Ahmadi, A., & Sadeghi, E. (2016). Assessing English language learners’ oral performance: A comparison of monologue, interview, and group oral test. Language Assessment Quarterly, 13(4), 341-358.
Attali, Y. (2016). A comparison of newly-trained and experienced raters on a standardized writing assessment. Language Testing, 33(1), 99-115.
Barkaoui, K. (2011). Think-aloud protocols in research on essay rating: An empirical study on their veridicality and reactivity. Language Testing, 28(1), 51-75.
Barrett, S. (2001). The impact of training on rater variability. International Education Journal, 2(1), 49-58.
Bijani, H. (2010). Raters’ perception and expertise in evaluating second language compositions. The Journal of applied linguistics, 3(2), 69-89.
Bijani, H., & Fahim, M. (2011). The effects of rater training on raters’ severity and bias analysis in second language writing. Iranian Journal of Language Testing, 1(1), 1-16.
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89-110.
Caban, H. L. (2003). Rater group bias in speaking assessment of four L1 Japanese ESL students. Second Language Studies, 21(1), 1-44.
Cohen, L., Manion, L., & Morrison, K. (2007). Research methods in education. London, England: Routledge.
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7(1), 31-51.
Davis, L. (2009). The influence of interlocutor proficiency in a paired oral assessment. Language Testing, 26(3), 367-396.
Davis, L. (2016). The influence of training and experience on rater performance in scoring spoken language. Language Testing, 33(1), 117-135.
Eckes, T. (2015). Introduction to many-facet Rasch measurement. Frankfurt, Germany: Peter Lang Edition.
Educational Testing Service (2001). ETS oral proficiency testing manual. Princeton, NJ: Author.
Gan, Z. (2010). Interaction in group oral assessment: A case study of higher-and lower-scoring students. Language Testing, 27(4), 585-602.
Huang, H., Huang, S., & Hong, H. (2016). Test-taker characteristics and integrated speaking test performance: A path-analytic study. Language Assessment Quarterly, 13(4), 283-301.
In’nami, Y., & Koizumi, R. (2016). Task and rater effects in L2 speaking and writing: A synthesis of generalizability studies. Language Testing, 33(3), 341-366.
Khabbazbashi, N. (2017). Topic and background knowledge effects on performance in speaking assessment. Language Testing, 34(1), 23-48.
Kim, H. J. (2011). Investigating raters’ development of rating ability on a second language speaking assessment (Unpublished doctoral dissertation). University of Columbia, New York.
Kim, H. J. (2015). A qualitative analysis of rater behavior on an L2 speaking assessment. Language Assessment Quarterly, 12(3), 239-261.
Kondo-Brown, K. (2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3-31.
Kuiken, F., & Vedder, I. (2014). Raters’ decisions, rating procedures and rating scales. Language Testing, 31(3), 279-284.
Kyle, K., Crossley, S. A., & McNamara, D. S. (2016). Construct validity in TOEFL iBT speaking tasks: Insights from natural language processing. Language Testing, 33(3), 319-340.
Leaper, D. A., & Riazi, M. (2014). The influence of prompt on group oral tests. Language Testing, 31(2), 177-204.
Lim, G. S. (2011). The development and maintenance of rating quality in performance writing assessment: A longitudinal study of new and experienced raters. Language Testing, 28(4), 543-560.
Linacre, J. M. (1989). Many-faceted Rasch measurement. Chicago, IL: MESA Press.
McNamara, T. F. (1996). Measuring second language performance. London, England: Longman.
McNamara, T. F., & Lumley, T. (1997). The effect of interlocutor and assessment mode variables in overseas assessments of speaking skills in occupational settings. Language Testing, 14(2), 140-156.
Nakatsuhara, F. (2011). Effect of test-taker characteristics and the number of participants in group oral tests. Language Testing, 28(4), 483-508.
Steiger, J. H., (1980). Test for comparing elements of a correlation matrix. Psychological Bulletin, 87(2), 245-251.
Van Moere, A. (2012). A psycholinguistic approach to oral language assessment. Language Testing, 29(3), 325-344.
Winke, P., Gass, S., & Myford, C. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing, 30(2), 231-252.
Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 369-386.