Application of Computational Linguistics to Predicting Language Proficiency Level of Persian Learners’ Textbooks

Ghayoomi, Masood

doi:10.22051/lghor.2021.32656.1354

Document Type : Research article

Author

Masood Ghayoomi

Assistant Professor, Faculty of Linguistics, Institute for the Humanities and Cultural Studies, Tehran, Iran;

https://doi.org/10.22051/lghor.2021.32656.1354

Abstract

One subfield of assessment of language proficiency is predicting language proficiency level.
This research aims at proposing a computational linguistic model to predict language proficiency level and to explore the general properties of the levels. To this end, a corpus is developed from Persian learners' textbooks and statistical and linguistic features are extracted from this text corpus to train three classifiers as learners. The performance of the models vary based on the learning algorithm and the feature set(s) used for training the models. For evaluating the models, four standard metrics, namely accuracy, precision, recall, and F-measure were used.
Based on the results, the model created by the Random Forest classifier performed the best when statistical features extracted from raw text is used. The Support Vector Machine classifier performed the best by using linguistic features extracted from the automatically annotated corpus. The results determine that enriching the model and providing various kinds of information do not guarantee that a classifier (learner) performs the best.

To discover the latent teaching methodology of the textbooks, the general performance of the classifiers with respect to the language level and the linguistic knowledge used for creating the model are studied. Based on the obtained results, the amount of extracted features plays an important role in training a classifier. Furthermore, the average best performance of the classifiers is extending the linguistic knowledge from syntactic patterns at proficiency level A (beginner) to all linguistic information at levels B (intermediate) and C (advanced).

Keywords

20.1001.1.2588350.2022.6.1.2.3

References

Ahmadzadeh, K., Khosravi, A., Arastoopoor, S., & Tahmasebi, R. (2014). Assessing the readability of patient education materials about diabetes available in Shiraz Health Centers. Iranian Journal of Medical Education, 14(8), 661-667. http://ijme.mui.ac.ir/article-1-3157-en.pdf
Aslin, R., Saffran, J., & Newport, E. (1998). Computation of conditional probability statistics by 8-month old infants. Psychological Science, 9, 321-324. https://doi.org/10.1111/1467-9280.00063
Belkhir, S. (2020). Cognition and language learning: An introduction. In S. Belkhir (Ed.), Cognition and language learning (pp. 1-12). Cambridge Scholars Publishing.
Bijankhan, M. (2004). The role of corpora in writing a grammar: Introducing a software. Journal of Linguistics, 19(2), 48-67.
Bohnet, (2009). Efficient parsing of syntactic and semantic dependency structures. In Proceedings of the 13th conference on computational natural language learning: Shared task (pp. 67-72). Association for Computational Linguistics. https://www.aclweb.org/anthology/W09-1210.pdf
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the 5^th annual workshop on computational learning theory (pp. 144-152). https://doi.org/10.1145/130385.130401
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Brumfit, C., & Johnson, K. (1979). The communicative approach to language teaching, Oxford University Press.
Bush, M., & Terry, R. (1997). Technology-enhanced language learning, National Textbook Company.
Chomsky, N. (1965). Aspects of the Theory of Syntax. The MIT Press.
Chomsky, N. (1968). Language and Mind. Harcourt Brace Jovanovich.
Chomsky, N. (1975). Reflections on language. Pantheon Books.
Chomsky, N. (1980). Rules and representations. Behavioral and Brain Sciences, 3, 1-61. https://doi.org/10.1017/S0140525X00001515
Cramer, J. S. (2002). The origins of logistic regression. Technical Report (pp. 167-178). Tinbergen Institute. https://papers.tinbergen.nl/02119.pdf
Crossley, S. A., Salsbury, T., & McNamara, D. S. (2011). Predicting the proficiency level of language learners using lexical indices. Language Testing, 29(2), 240-260. https://doi.org/10.1177/0265532211419331
de Wet, F., Van Der Walt, C., & Niesler, T. R. (2009). Automatic assessment of oral language proficiency and listening comprehension. Speech Communication, 52, 864-874. https://doi.org/10.1016/j.specom.2009.03.002
Djigunović, J. M., & Krajnović, M. M. (2005). Language teaching methodology and second language acquisition. In V. Muhvic-Dimanovski & L. Socanac (Eds.), Encyclopedia of life support systems, (pp. 394-417). Eolss Publishers Co. Ltd.
Doró, K. (2011). English language proficiency and the prediction of academic success of first-year students of English. UPRT 2010: Empirical studies in English applied linguistics (pp. 173-186). Lingua Franca Csoport. http://publicatio.bibl.uszeged.hu/11049/1/Doro%202011%20Language%20proficiency%20and%20academic%20success.pdf
Ebbinghaus, H. (1964). Memory: A contribution to experimental psychology. (H. A. Ruger & C. E. Bussenius, Trans.). Dover Publications. (Original work published 1885). https://doi.org/10.5214/ans.0972.7531.200408
Elliott, S. N., Kratochwill, T. R., Littlefield, C., J., & Travers, J. (2000). Educational psychology: Effective teaching, effective learning (3rd Ed.). McGraw-Hill College.
Ellis, R. (1997). Second language acquisition. Oxford University Press.
Eslami, M., Mosavi Atashgah, M., Alizadeh Lamjiri, S., & Zandi, T. (2004). Persian productive lexicon. In Proceedings of the 1st workshop on the Persian language and computer, University of Tehran.
Evans, V. (2014). The language myth: Why language is not an instinct. Cambridge University Press.
Ghaderi Moghaddam, M. E., & Sobhaninejad, M. (2016). Validation methods to measure textbooks readability. Research in Curriculum Planning, 13(21), 44-55.
Ghaffari, M., Mahmoodi Bakhtiyari, B., & Zolfaghari, H. (2004). Let’s learn Persian (Volumes 1-3). Madreseh Publication. https://jsr-e.khuisf.ac.ir/article_534415_65a3945c9994bc90c81c23ab0eacfaf7.pdf?lang=en
Ghayoomi, M. (2012). Bootstrapping the development of an HPSG-based treebank for Persian. Linguistic Issues in Language Technology, 7(1).
Ghayoomi, M. (2013). Word clustering for Persian statistical parsing. In H. Isahara, & K. Kanzaki, (Eds.), Advances in natural language processing, (pp. 126-137). Springer. https://doi.org/10.1007/978-3-642-33983-7_13
Ghayoomi, M. (2018). The problem of multi-words in syntactic processing of Persian. In Proceedings of the fourth nation conference on computational linguistics (pp. 11-40). Neviseh Parsi Publications.
Ghayoomi, M. (2019a). Studying issues for automatic processing of the Persian language on lemmatization, part-of-speech tagging, and parsing: Developing a software using machine learning methods. Technical Report. Tehran, Iran.
Ghayoomi, M. (2019b). Transition from rule-based to statistical lemmatization in Persian. In Proceedings of the 5th national conference on computational linguistics (pp. 57-86). Neveeseh Parsi Publications.
Ghayoomi, M. (2020). Structuring multilayer linguistic analyses in linguistic corpora. In F. Ghatreh & Sh. Modarres Khiabani, (Eds.), Word by word of life: Festschrift for professor Vida Shaghaghi (pp. 287-312). Neveeseh Parsi Publications.
Ghayoomi, M., & Kuhn, J. (2014). Converting an HPSG-based treebank into its parallel dependency-based Treebank. In Proceedings of the 9th international conference on language resources and evaluation (pp. 802-809). Reykjavik, Iceland. http://www.lrec-conf.org/proceedings/lrec2014/pdf/441_Paper.pdf
Gomez, R., & Gerken, L. (1999). Artificial grammar learning by one-year-olds leads to specific and abstract knowledge. Cognition, 70, 109-135. https://doi.org/10.1016/S0010-0277(99)00003-7
Indurkhya, N., & Damerau, F. J. (2010). Handbook of natural language processing. Chapman & Hall.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Springer.
Jung, Y. J., Crossley, S., & McNamara, D. (2019). Predicting second language writing proficiency in learner texts using computational tools. Journal of Asia TEFL, 16(1), 37-52.https://doi.org/10.18823/asiatefl.2019.16.1.3.37
Khademizadeh, S., & Vaezi, M. R. (2020). Evaluating readability of Persian fictions selected by flying Turtle the Iranian award. Publishing Research Quarterly, 36, 116-128. https://doi.org/10.1007/s12109-019-09705-0
Khodadady, E., & Mehrazmay, R. (2017). Evaluating two high intermediate EFL and ESL textbooks: A comparative study based on readability indices. Sociology International Journal, 1(3), 93-102. https://doi.org/10.15406/SIJ.2017.01.00016
Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st meeting of the association for computational linguistics (pp. 423-430). https://doi.org/10.3115/1075096.1075150
Levy, M., & Stockwell, G. (2006). CALL dimensions: Options and issues in computer assisted language learning. Lawrence Erlbaum Associates.
Luo, D., Minematsu, N., Yamauchi, Y., & Hirose, K. (2008). Automatic assessment of language proficiency through shadowing. In Proceedings of 6th international symposium on Chinese spoken language processing (pp. 41-44). https://doi.org/10.1109/CHINSL.2008.ECP.22
MacWhinney, B. (1999). The emergence of language. Lawrence Erlbaum Associates.
MacWhinney, B. (2005). Item-based constructions and the logical problem. In Proceedings of the workshop on psychocomputational models of human language acquisition (pp. 53-68). Ann Arbor, Michigan. https://doi.org/10.3115/1654524.1654531
MacWhinney, B. (2010). Computational models of child language learning: An introduction. Journal of Child Language, 37(3), 477-485. https://doi.org/10.1017/S0305000910000139
Maftoon, P., & Daghigh, M. (2001). Metric of determining readability of translated texts from English into Persian. Humanities Bulletin, 29, 61-80. https://www.sid.ir/fa/journal/ViewPaper.aspx?id=27487
Marty, F. (1981). Reflections on the use of computers in second language acquisition. System, 9(2), 85-98. https://eric.ed.gov/?id=ED218932
Matlin, M. W. (2005). Cognition. John Wiley and Sons.
Matusevych, Y., Alishahi, A., & Backus, A. (2013). Computational simulations of second language construction learning. In Proceedings of the workshop on cognitive modeling and computational linguistics (pp. 47-56). Sofia, Bulgaria. Association for Computational Linguistics. https://www.aclweb.org/anthology/W13-2606.pdf
McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach. Language Testing, 37(3), 389-411. https://doi.org/10.1177/0265532219898380
Mohammadi, H., & Khasteh, S. H. (2020). A machine learning approach to Persian text readability assessment using a crowd-sourced dataset. In Proceedings of the 28^th Iranian conference on electrical engineering, University of Tabriz. https://doi.org/10.1109/ICEE50131.2020.9260933
Monaghan, P., Chang, Y. N., Welbourne, S., & Brysbaert, M. (2017). Exploring the relations between word frequency, language exposure, and bilingualism in a computational model of reading. Journal of Memory and Language, 93, 1-21. https://doi.org/10.1016/j.jml.2016.08.003
Müller, T., Cotterell, R., Fraser, A., & Schütze, H. (2015). Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2268-2274). Lisbon, Portugal. Association for Computational Linguistics. https://www.aclweb.org/anthology/D15-1272.pdf
Müller, T., Schmid, H., & Schütze, H. (2013). Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 322-332). Seattle, Washington, USA. Association for Computational Linguistics. https://www.aclweb.org/anthology/D13-1032.pdf
Nazari, F., Farhadpour, M. R., & Soleymani, E. (2016). Measure the readability of the Persian text of the ‘Lets know More’ section of the Quran book for the grades two, three, and four of elementary school based on the Flash-Diani and Galing-Diani formulas. Quarterly Journal of Knowledge and Information Management, 3(3), 85-92. http://lib.journals.pnu.ac.ir/article_4415_f2b05f84f03592edc72327a8a72ec55b.pdf?lang=en
Newport, E., & Aslin, R. (2000). Innately constrained learning: Blending old and new approaches to language acquisition. In S. Howell, S. Fish, & T. Keith-Lucas, (Eds.), Proceedings of the 24th annual Boston University conference on language development, Somerville, MA. Cascadilla Press.
Paribakht, T., & Wesche, M. (1999). Reading and ‘incidental’ L2 vocabulary acquisition: An introspective study of lexical referencing. Studies in Second Language Acquisition, 21(1), 195-224. https://doi.org/10.1017/S027226319900203X
Pilán, I., Alfter, D., & Volodina, E. (2016). Coursebook texts as a helping hand for classifying linguistic complexity in language learners’ writings. In Proceedings of the workshop on computational linguistics for linguistic complexity, (pp. 120-126). Osaka, Japan. https://www.aclweb.org/anthology/W16-4114.pdf
Pilán, I., Volodina, E., & Zesch, T (2016). Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks. In Proceedings of the 26th international conference on computational linguistics: Technical papers (pp. 2101-2111). Osaka, Japan. https://www.aclweb.org/anthology/C16-1198.pdf
Pinker, S. (1994). The language instinct. William Morrow and Company.
Pinker, S. (1996). Language learnability and language development. Harvard University Press.
Pollard, C. J., & Sag, I. A. (1994). Head-driven phrase structure grammar. University of Chicago Press.
Poornamdariyan, T. (1994). The Persian lesson for foreign Persian learners (For beginners). Institute for Humanities and Cultural Studies Publications.
Postman, L., & Keppel, G. (1969). Verbal learning and memory. Penguin Books.
Prabhu, N. S. (1987). Second language pedagogy. Oxford University Press.
Robinson, P. (2001). Task complexity, cognitive load, and syllabus design. In P. Robinson, (Ed.), Cognition and second language instruction (pp. 287-318(. Cambridge University Press.
Saffar Moghaddam, A. (2003). General Persian: Basic constructions. Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
Saffar Moghaddam, A. (2008a). The Persian language (Vol. 1). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
Saffar Moghaddam, A. (2008b). The Persian language (Vol. 2). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
Saffar Moghaddam, A. (2008c). The Persian language. (Vol. 3). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
Saffar Moghaddam, A. (2008d). The Persian language. (Vol. 4). Council of Extending Persian Language and Linguistics at the Institute for Humanities and Cultural Studies.
Saffran, J., Aslin, R., & Newport, E. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926-1928. https://doi.org/10.1126/science.274.5294.1926
Salton, G. M., Andrew W., & Chung-Shu Y. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. https://doi.org/10.1145/361219.361220
Samareh, Y. (1989). Teaching the Persian language (Vol. 1). Alhoda International Publications.
Samareh, Y. (2005a). Teaching the Persian language (Vol. 2). Alhoda International Publications.
Samareh, Y. (2005b). Teaching the Persian language. (Vol. 3). Alhoda International Publications.
Samareh, Y. (2005c). Teaching the Persian language. (Vol. 4). Alhoda International Publications.
Sarvi, A., Talebnia, G., Pourzamani, Z., & Jahanshad, A. (2019). Assessment readability and understandability of accounting standards by accountants and auditors using Flesch and Cloze Indexes. Applied Research in Financial Reporting, 7(2), 241-274. http://www.arfr.ir/article_85308_8ee110e57414180e4fc5eec833f18000.pdf?lang=en
Shekari, A., & Najareyan, Z. (2012). A study of the readability of Hedyehaye Asemani textbooks for grades four and five based on Gunning scale. Research in Curriculum Planning, 9(6), 71-79. http://jsr-e.khuisf.ac.ir/article_534233_1f574dc5383e52c94da235658f255a2a.pdf
Skehan, P. (1998). A cognitive approach to language learning. Oxford University Press.
Sulistyowati, T. (2019). Bottom-up and top-down listening progress within cognitive constructivist learning theory. Prominent Journal of English Studies, 2(1), 92-100. https://doi.org/10.24176/pro.v2i1.2962
ten Bosch, L., Boves, L., Van Hamme, H., & Moore, R. K. (2009). A computational model of language acquisition: The emergence of words. Fundamenta Informaticae, 90(3), 229-249. https://doi.org/10.3233/FI-2009-0016
Thomas, M., & Reinders, H. (2010). Deconstructing tasks and technology. In M. Thomas & H. Reinders, (Eds.) Task-based language learning and teaching with technology (pp. 1-13). Continuum International Publishing Group.
Tomasello, M. (1992). First verbs: A case study of early grammatical development. Cambridge University Press.
Tomasello, M. (2000). The item-based nature of children’s early syntactic development. Early language development, 4(4), 156-163. https://doi.org/10.1016/S1364-6613(00)01462-5
Tomasello, M. (2006). Acquiring linguistic constructions. In D. Kuhn & R. Siegler (Eds.) Handbook of child psychology (pp. 255-298). Wiley. https://doi.org/10.1002/9780470147658.chpsy0206
Uchihara, T., Webb, S., & Yanagisawa, A. (2019). The effects of repetition on incidental vocabulary learning: A meta-analysis of correlational studies. Language learning, 69(3), 559-599. https://doi.org/10.1111/lang.12343
Vaezi, M. R., Kokabi, M., & Ebrahimi, M. (2016). Investigation of the readability level of authored fiction books, selected by Children's Book Council of Iran. Research on Information Science & Public Libraries, 21(4), 629-649. http://publij.ir/article-1-1085-fa.pdf
van Rijsbergen, C. J. (1979). Information retrieval, 2nd ed. Butterworth-Heinemann.
Widdowson, H. G. (1978). Teaching language as communication. Oxford University Press.
Yang, Y., Yu, W., & Lim, H. (2016). Predicting second language proficiency level using linguistic cognitive task and machine learning techniques. Wireless Pers Commun, 86, 271-285. https://doi.org/10.1007/s11277-015-3062-2
Zarea Gavgani V., Mirzadeh-Qasabeh, S., Hanaee, J., & Hamishehkar, H. (2018). Calculating reading ease score of patient package inserts in Iran. Drug Healthc Patient Safety, 19(10), 9-19. https://doi.org/10.2147/DHPS.S150428
Zarghamiyan, M. (1998). Series of teaching the Persian language (From Beginner to Advanced) (Vol. 1). Council of Extending Persian Language and Linguistics.
Zarghamiyan, M. (2001a). Series of teaching the Persian language (From Beginner to Advanced) (Vol. 2). Council of Extending Persian Language and Linguistics.
Zarghamiyan, M. (2001b). Series of teaching the Persian language (From Beginner to Advanced) (Vol. 3). Council of Extending Persian Language and Linguistics.
Zeinali, V., Haghparast, A., Damerchilou, M., & Vazifehshenas, N. (2019). Quality and readability of online health information produced by the Ministry of Health and Medical Education of Iran. Journal of Health Administration, 21(74), 65-74. http://jha.iums.ac.ir/article-1-2798-en.pdf

Journal of Language Horizons

Application of Computational Linguistics to Predicting Language Proficiency Level of Persian Learners’ Textbooks

References

References

Volume 6, Issue 1 - Serial Number 11
May 2022
Pages 29-52

Application of Computational Linguistics to Predicting Language Proficiency Level of Persian Learners’ Textbooks

References

References

Volume 6, Issue 1 - Serial Number 11May 2022Pages 29-52

Volume 6, Issue 1 - Serial Number 11
May 2022
Pages 29-52