Document Type : Research article


Assistant Professor, Faculty of Linguistics, Institute for the Humanities and Cultural Studies, Tehran, Iran;


One subfield of assessment of language proficiency is predicting language proficiency level.
This research aims at proposing a computational linguistic model to predict language proficiency level and to explore the general properties of the levels. To this end, a corpus is developed from Persian learners' textbooks and statistical and linguistic features are extracted from this text corpus to train three classifiers as learners. The performance of the models vary based on the learning algorithm and the feature set(s) used for training the models. For evaluating the models, four standard metrics, namely accuracy, precision, recall, and F-measure were used.
Based on the results, the model created by the Random Forest classifier performed the best when statistical features extracted from raw text is used. The Support Vector Machine classifier performed the best by using linguistic features extracted from the automatically annotated corpus. The results determine that enriching the model and providing various kinds of information do not guarantee that a classifier (learner) performs the best.

To discover the latent teaching methodology of the textbooks, the general performance of the classifiers with respect to the language level and the linguistic knowledge used for creating the model are studied. Based on the obtained results, the amount of extracted features plays an important role in training a classifier. Furthermore, the average best performance of the classifiers is extending the linguistic knowledge from syntactic patterns at proficiency level A (beginner) to all linguistic information at levels B (intermediate) and C (advanced).


  1. Ahmadzadeh, K., Khosravi, A., Arastoopoor, S., & Tahmasebi, R. (2014). Assessing the readability of patient education materials about diabetes available in Shiraz Health Centers. Iranian Journal of Medical Education, 14(8), 661-667.
  2. Aslin, R., Saffran, J., & Newport, E. (1998). Computation of conditional probability statistics by 8-month old infants. Psychological Science, 9, 321-324.
  3. Belkhir, S. (2020). Cognition and language learning: An introduction. In S. Belkhir (Ed.), Cognition and language learning (pp. 1-12). Cambridge Scholars Publishing.
  4. Bijankhan, M. (2004). The role of corpora in writing a grammar: Introducing a software. Journal of Linguistics, 19(2), 48-67.
  5. Bohnet, (2009). Efficient parsing of syntactic and semantic dependency structures. In Proceedings of the 13th conference on computational natural language learning: Shared task (pp. 67-72). Association for Computational Linguistics.
  6. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the 5th annual workshop on computational learning theory (pp. 144-152).
  7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
  8. Brumfit, C., & Johnson, K. (1979). The communicative approach to language teaching, Oxford University Press.
  9. Bush, M., & Terry, R. (1997). Technology-enhanced language learning, National Textbook Company.
  10. Chomsky, N. (1965). Aspects of the Theory of Syntax. The MIT Press.
  11. Chomsky, N. (1968). Language and Mind. Harcourt Brace Jovanovich.
  12. Chomsky, N. (1975). Reflections on language. Pantheon Books.
  13. Chomsky, N. (1980). Rules and representations. Behavioral and Brain Sciences, 3, 1-61.
  14. Cramer, J. S. (2002). The origins of logistic regression. Technical Report (pp. 167-178). Tinbergen Institute.
  15. Crossley, S. A., Salsbury, T., & McNamara, D. S. (2011). Predicting the proficiency level of language learners using lexical indices. Language Testing, 29(2), 240-260.
  16. de Wet, F., Van Der Walt, C., & Niesler, T. R. (2009). Automatic assessment of oral language proficiency and listening comprehension. Speech Communication, 52, 864-874.
  17. Djigunović, J. M., & Krajnović, M. M. (2005). Language teaching methodology and second language acquisition. In V. Muhvic-Dimanovski & L. Socanac (Eds.), Encyclopedia of life support systems, (pp. 394-417). Eolss Publishers Co. Ltd.
  18. Doró, K. (2011). English language proficiency and the prediction of academic success of first-year students of English. UPRT 2010: Empirical studies in English applied linguistics (pp. 173-186). Lingua Franca Csoport.
  19. Ebbinghaus, H. (1964). Memory: A contribution to experimental psychology. (H. A. Ruger & C. E. Bussenius, Trans.). Dover Publications. (Original work published 1885).
  20. Elliott, S. N., Kratochwill, T. R., Littlefield, C., J., & Travers, J. (2000). Educational psychology: Effective teaching, effective learning (3rd Ed.). McGraw-Hill College.
  21. Ellis, R. (1997). Second language acquisition. Oxford University Press.
  22. Eslami, M., Mosavi Atashgah, M., Alizadeh Lamjiri, S., & Zandi, T. (2004). Persian productive lexicon. In Proceedings of the 1st workshop on the Persian language and computer, University of Tehran.
  23. Evans, V. (2014). The language myth: Why language is not an instinct. Cambridge University Press.
  24. Ghaderi Moghaddam, M. E., & Sobhaninejad, M. (2016). Validation methods to measure textbooks readability. Research in Curriculum Planning, 13(21), 44-55.
  25. Ghaffari, M., Mahmoodi Bakhtiyari, B., & Zolfaghari, H. (2004). Let’s learn Persian (Volumes 1-3). Madreseh Publication.
  26. Ghayoomi, M. (2012). Bootstrapping the development of an HPSG-based treebank for Persian. Linguistic Issues in Language Technology, 7(1).
  27. Ghayoomi, M. (2013). Word clustering for Persian statistical parsing. In H. Isahara, & K. Kanzaki, (Eds.), Advances in natural language processing, (pp. 126-137). Springer.
  28. Ghayoomi, M. (2018). The problem of multi-words in syntactic processing of Persian. In Proceedings of the fourth nation conference on computational linguistics (pp. 11-40). Neviseh Parsi Publications.
  29. Ghayoomi, M. (2019a). Studying issues for automatic processing of the Persian language on lemmatization, part-of-speech tagging, and parsing: Developing a software using machine learning methods. Technical Report. Tehran, Iran.
  30. Ghayoomi, M. (2019b). Transition from rule-based to statistical lemmatization in Persian. In Proceedings of the 5th national conference on computational linguistics (pp. 57-86). Neveeseh Parsi Publications.
  31. Ghayoomi, M. (2020). Structuring multilayer linguistic analyses in linguistic corpora. In F. Ghatreh & Sh. Modarres Khiabani, (Eds.), Word by word of life: Festschrift for professor Vida Shaghaghi (pp. 287-312). Neveeseh Parsi Publications.
  32. Ghayoomi, M., & Kuhn, J. (2014). Converting an HPSG-based treebank into its parallel dependency-based Treebank. In Proceedings of the 9th international conference on language resources and evaluation (pp. 802-809). Reykjavik, Iceland.
  33. Gomez, R., & Gerken, L. (1999). Artificial grammar learning by one-year-olds leads to specific and abstract knowledge. Cognition, 70, 109-135.
  34. Indurkhya, N., & Damerau, F. J. (2010). Handbook of natural language processing. Chapman & Hall.
  35. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Springer.
  36. Jung, Y. J., Crossley, S., & McNamara, D. (2019). Predicting second language writing proficiency in learner texts using computational tools. Journal of Asia TEFL, 16(1), 37-52.
  37. Khademizadeh, S., & Vaezi, M. R. (2020). Evaluating readability of Persian fictions selected by flying Turtle the Iranian award. Publishing Research Quarterly, 36, 116-128.
  38. Khodadady, E., & Mehrazmay, R. (2017). Evaluating two high intermediate EFL and ESL textbooks: A comparative study based on readability indices. Sociology International Journal, 1(3), 93-102.
  39. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st meeting of the association for computational linguistics (pp. 423-430).
  40. Levy, M., & Stockwell, G. (2006). CALL dimensions: Options and issues in computer assisted language learning. Lawrence Erlbaum Associates.
  41. Luo, D., Minematsu, N., Yamauchi, Y., & Hirose, K. (2008). Automatic assessment of language proficiency through shadowing. In Proceedings of 6th international symposium on Chinese spoken language processing (pp. 41-44).
  42. MacWhinney, B. (1999). The emergence of language. Lawrence Erlbaum Associates.
  43. MacWhinney, B. (2005). Item-based constructions and the logical problem. In Proceedings of the workshop on psychocomputational models of human language acquisition (pp. 53-68). Ann Arbor, Michigan.
  44. MacWhinney, B. (2010). Computational models of child language learning: An introduction. Journal of Child Language, 37(3), 477-485.
  45. Maftoon, P., & Daghigh, M. (2001). Metric of determining readability of translated texts from English into Persian. Humanities Bulletin, 29, 61-80.
  46. Marty, F. (1981). Reflections on the use of computers in second language acquisition. System, 9(2), 85-98.
  47. Matlin, M. W. (2005). Cognition. John Wiley and Sons.
  48. Matusevych, Y., Alishahi, A., & Backus, A. (2013). Computational simulations of second language construction learning. In Proceedings of the workshop on cognitive modeling and computational linguistics (pp. 47-56). Sofia, Bulgaria. Association for Computational Linguistics.
  49. McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach. Language Testing, 37(3), 389-411.
  50. Mohammadi, H., & Khasteh, S. H. (2020). A machine learning approach to Persian text readability assessment using a crowd-sourced dataset. In Proceedings of the 28th Iranian conference on electrical engineering, University of Tabriz.
  51. Monaghan, P., Chang, Y. N., Welbourne, S., & Brysbaert, M. (2017). Exploring the relations between word frequency, language exposure, and bilingualism in a computational model of reading. Journal of Memory and Language, 93, 1-21.
  52. Müller, T., Cotterell, R., Fraser, A., & Schütze, H. (2015). Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2268-2274). Lisbon, Portugal. Association for Computational Linguistics.
  53. Müller, T., Schmid, H., & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 322-332). Seattle, Washington, USA. Association for Computational Linguistics.
  54. Nazari, F., Farhadpour, M. R., & Soleymani, E. (2016). Measure the readability of the Persian text of the ‘Lets know More’ section of the Quran book for the grades two, three, and four of elementary school based on the Flash-Diani and Galing-Diani formulas. Quarterly Journal of Knowledge and Information Management, 3(3), 85-92.
  55. Newport, E., & Aslin, R. (2000). Innately constrained learning: Blending old and new approaches to language acquisition. In S. Howell, S. Fish, & T. Keith-Lucas, (Eds.), Proceedings of the 24th annual Boston University conference on language development, Somerville, MA. Cascadilla Press.
  56. Paribakht, T., & Wesche, M. (1999). Reading and ‘incidental’ L2 vocabulary acquisition: An introspective study of lexical referencing. Studies in Second Language Acquisition, 21(1), 195-224.
  57. Pilán, I., Alfter, D., & Volodina, E. (2016). Coursebook texts as a helping hand for classifying linguistic complexity in language learners’ writings. In Proceedings of the workshop on computational linguistics for linguistic complexity, (pp. 120-126). Osaka, Japan.
  58. Pilán, I., Volodina, E., & Zesch, T (2016). Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks. In Proceedings of the 26th international conference on computational linguistics: Technical papers (pp. 2101-2111). Osaka, Japan.
  59. Pinker, S. (1994). The language instinct. William Morrow and Company.
  60. Pinker, S. (1996). Language learnability and language development. Harvard University Press.
  61. Pollard, C. J., & Sag, I. A. (1994). Head-driven phrase structure grammar. University of Chicago Press.
  62. Poornamdariyan, T. (1994). The Persian lesson for foreign Persian learners (For beginners). Institute for Humanities and Cultural Studies Publications.
  63. Postman, L., & Keppel, G. (1969). Verbal learning and memory. Penguin Books.
  64. Prabhu, N. S. (1987). Second language pedagogy. Oxford University Press.
  65. Robinson, P. (2001). Task complexity, cognitive load, and syllabus design. In P. Robinson, (Ed.), Cognition and second language instruction (pp. 287-318(. Cambridge University Press.
  66. Saffar Moghaddam, A. (2003). General Persian: Basic constructions. Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
  67. Saffar Moghaddam, A. (2008a). The Persian language (Vol. 1). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
  68. Saffar Moghaddam, A. (2008b). The Persian language (Vol. 2). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
  69. Saffar Moghaddam, A. (2008c). The Persian language. (Vol. 3). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
  70. Saffar Moghaddam, A. (2008d). The Persian language. (Vol. 4). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
  71. Saffran, J., Aslin, R., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926-1928.
  72. Salton, G. M., Andrew W., & Chung-Shu Y. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
  73. Samareh, Y. (1989). Teaching the Persian language (Vol. 1). Alhoda International Publications.
  74. Samareh, Y. (2005a). Teaching the Persian language (Vol. 2). Alhoda International Publications.
  75. Samareh, Y. (2005b). Teaching the Persian language. (Vol. 3). Alhoda International Publications.
  76. Samareh, Y. (2005c). Teaching the Persian language. (Vol. 4). Alhoda International Publications.
  77. Sarvi, A., Talebnia, G., Pourzamani, Z., & Jahanshad, A. (2019). Assessment readability and understandability of accounting standards by accountants and auditors using Flesch and Cloze Indexes. Applied Research in Financial Reporting, 7(2), 241-274.
  78. Shekari, A., & Najareyan, Z. (2012). A study of the readability of Hedyehaye Asemani textbooks for grades four and five based on Gunning scale. Research in Curriculum Planning, 9(6), 71-79.
  79. Skehan, P. (1998). A cognitive approach to language learning. Oxford University Press.
  80. Sulistyowati, T. (2019). Bottom-up and top-down listening progress within cognitive constructivist learning theory. Prominent Journal of English Studies, 2(1), 92-100.
  81. ten Bosch, L., Boves, L., Van Hamme, H., & Moore, R. K. (2009). A computational model of language acquisition: The emergence of words. Fundamenta Informaticae, 90(3), 229-249.
  82. Thomas, M., & Reinders, H. (2010). Deconstructing tasks and technology. In M. Thomas & H. Reinders, (Eds.) Task-based language learning and teaching with technology (pp. 1-13). Continuum International Publishing Group.
  83. Tomasello, M. (1992). First verbs: A case study of early grammatical development. Cambridge University Press.
  84. Tomasello, M. (2000). The item-based nature of children’s early syntactic development. Early language development, 4(4), 156-163.
  85. Tomasello, M. (2006). Acquiring linguistic constructions. In D. Kuhn & R. Siegler (Eds.) Handbook of child psychology (pp. 255-298). Wiley.
  86. Uchihara, T., Webb, S., & Yanagisawa, A. (2019). The effects of repetition on incidental vocabulary learning: A meta-analysis of correlational studies. Language learning, 69(3), 559-599.
  87. Vaezi, M. R., Kokabi, M., & Ebrahimi, M. (2016). Investigation of the readability level of authored fiction books, selected by Children's Book Council of Iran. Research on Information Science & Public Libraries, 21(4), 629-649.
  88. van Rijsbergen, C. J. (1979). Information retrieval, 2nd ed. Butterworth-Heinemann.
  89. Widdowson, H. G. (1978). Teaching language as communication. Oxford University Press.
  90. Yang, Y., Yu, W., & Lim, H. (2016). Predicting second language proficiency level using linguistic cognitive task and machine learning techniques. Wireless Pers Commun, 86, 271-285.
  91. Zarea Gavgani V., Mirzadeh-Qasabeh, S., Hanaee, J., & Hamishehkar, H. (2018). Calculating reading ease score of patient package inserts in Iran. Drug Healthc Patient Safety, 19(10), 9-19.
  92. Zarghamiyan, M. (1998). Series of teaching the Persian language (From Beginner to Advanced) (Vol. 1). Council of Extending Persian Language and Linguistics.
  93. Zarghamiyan, M. (2001a). Series of teaching the Persian language (From Beginner to Advanced) (Vol. 2). Council of Extending Persian Language and Linguistics.
  94. Zarghamiyan, M. (2001b). Series of teaching the Persian language (From Beginner to Advanced) (Vol. 3). Council of Extending Persian Language and Linguistics.
  95. Zeinali, V., Haghparast, A., Damerchilou, M., & Vazifehshenas, N. (2019). Quality and readability of online health information produced by the Ministry of Health and Medical Education of Iran. Journal of Health Administration, 21(74), 65-74.