A Comparative Study of IRT Models for Rater Effects and Double Scoring