Document Type: Research article


Zanjan Branch, Islamic Azad University, Zanjan, Iran


Since scoring oral language proficiency is performed by raters, they are an essential part of performance assessment. One important feature of raters is their teaching and rating experience which has attracted considerable attention. In a majority of previous studies on rater training, extremely severe or lenient raters, benefited more from training programs and thus results of this training showed significant severity/leniency reduction in their rating behavior. However, they mostly investigated the application of FACETS on only one or two facets and few have used a pre, post-training design. Besides, empirical studies have reported contrasting outcomes, not showing clearly which group of raters does rating more reliably than the other. In this study, 20 experienced and inexperienced raters rated the oral performances produced by 200 test-takers before and after a training program. The results indicated that training leads to higher measures of interrater consistency and reduces measures of biases towards using rating scale categories. Moreover, since it is almost impossible to completely eradicate rater variability even if training is applied, rater training procedure had better had better be regarded as a procedure to make raters more self-consistent (intrarater reliability) rather than consistent with each other (interrater reliability). The findings of this study indicated that inexperienced and experienced raters’ rating quality improved after training; however, inexperienced raters underwent higher consistency and less bias. Hence, there is no evidence that inexperienced raters should be excluded from rating solely because of their lack of adequate experience. Moreover, Inexperienced raters, being more economical than the experienced ones, cost less for decision-makers for rating. Therefore, instead of charging a bulky budget on experienced raters, decision-makers had better use the budget for establishing better training programs.


Ahmadi, A., & Sadeghi, E. (2016). Assessing English language learners’ oral performance: A comparison of monologue, interview, and group oral test. Language Assessment Quarterly, 13(4), 341-358.

Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115.

Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study on their veridicality and reactivity. Language Testing, 28(1), 51-75.

Barrett, S. (2001). The impact of training on rater variability. International Education Journal, 2(1), 49-58.

Bijani, H. (2010). Raters’ perception and expertise in evaluating second language compositions. The Journal of applied linguistics, 3(2), 69-89.

Bijani, H., & Fahim, M. (2011). The effects of rater training on raters’ severity and bias analysis in second language writing. Iranian Journal of Language Testing, 1(1), 1-16.

Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89-110.

Caban, H. L. (2003). Rater group bias in speaking assessment of four L1 Japanese ESL students. Second Language Studies, 21(1), 1-44.

Cohen, L., Manion, L., & Morrison, K. (2007). Research methods in education. London, England: Routledge.

Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31-51.

Davis, L. (2009). The influence of interlocutor proficiency in a paired oral assessment. Language Testing, 26(3), 367-396.

Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135.

Eckes, T. (2015). Introduction to many-facet Rasch measurement. Frankfurt, Germany: Peter Lang Edition.

Educational Testing Service (2001). ETS oral proficiency testing manual. Princeton, NJ: Author.

Gan, Z. (2010). Interaction in group oral assessment: A case study of higher-and lower-scoring students. Language Testing, 27(4), 585-602.

Huang, H., Huang, S., & Hong, H. (2016). Test-taker characteristics and integrated speaking test performance: A path-analytic study. Language Assessment Quarterly, 13(4), 283-301.

In’nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341-366.

Khabbazbashi, N. (2017). Topic and background knowledge effects on performance in speaking assessment. Language Testing, 34(1), 23-48.

Kim, H. J. (2011). Investigating raters’ development of rating ability on a second language speaking assessment (Unpublished doctoral dissertation). University of Columbia, New York.

Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239-261.

Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3-31.

Kuiken, F., & Vedder, I. (2014). Raters’ decisions, rating procedures and rating scales. Language Testing, 31(3), 279-284.

Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319-340.

Leaper, D. A., & Riazi, M. (2014). The influence of prompt on group oral tests. Language Testing, 31(2), 177-204.

Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543-560.

Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.

McNamara, T. F. (1996). Measuring second language performance. London, England: Longman.

McNamara, T. F., & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings. Language Testing, 14(2), 140-156.

Nakatsuhara, F. (2011). Effect of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483-508.

Steiger, J. H., (1980). Test for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245-251.

Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325-344.

Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252.

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 369-386.