Persian Ezafe Recognition Using Neural Approaches
محورهای موضوعی : Natural Language ProcessingHabibollah Asghari 1 , Heshaam Faili 2
1 - ICT Research Institute, ACECR, Tehran, Iran
2 - Department of ECE, School of Engineering, University of Tehran, Tehran, Iran
کلید واژه: Persian Ezafe Recognition, Vowel Restoration, Diacritization, Neural Sequence Labeling,
چکیده مقاله :
Persian Ezafe Recognition aims to automatically identify the occurrences of Ezafe (short vowel /e/) which should be pronounced but usually is not orthographically represented. This task is similar to the task of diacritization and vowel restoration in Arabic. Ezafe recognition can be used in spelling disambiguation in Text to Speech Systems (TTS) and various other language processing tasks such as syntactic parsing and semantic role labeling. In this paper, we propose two neural approaches for the automatic recognition of Ezafe markers in Persian texts. We have tackled the Ezafe recognition task by using a Neural Sequence Labeling method and a Neural Machine Translation (NMT) approach as well. Some syntactic features are proposed to be exploited in the neural models. We have used various combinations of lexical features such as word forms, Part of Speech Tags, and ending letter of the words to be applied to the models. These features were statistically derived using a large annotated Persian text corpus and were optimized by a forward selection method. In order to evaluate the performance of our approaches, we examined nine baseline models including state-of-the-art approaches for recognition of Ezafe markers in Persian text. Our experiments on Persian Ezafe recognition based on neural approaches employing some optimized features into the models show that they can drastically improve the results of the baselines. They can also achieve better results than the Conditional Random Field method as the best-performing baseline. On the other hand, although the results of the NMT approach show a better performance compared to other baseline approaches, it cannot achieve better performance than the Neural Sequence Labeling method. The best achieved F1-measure based on neural sequence labeling is 96.29%
Persian Ezafe Recognition aims to automatically identify the occurrences of Ezafe (short vowel /e/) which should be pronounced but usually is not orthographically represented. This task is similar to the task of diacritization and vowel restoration in Arabic. Ezafe recognition can be used in spelling disambiguation in Text to Speech Systems (TTS) and various other language processing tasks such as syntactic parsing and semantic role labeling. In this paper, we propose two neural approaches for the automatic recognition of Ezafe markers in Persian texts. We have tackled the Ezafe recognition task by using a Neural Sequence Labeling method and a Neural Machine Translation (NMT) approach as well. Some syntactic features are proposed to be exploited in the neural models. We have used various combinations of lexical features such as word forms, Part of Speech Tags, and ending letter of the words to be applied to the models. These features were statistically derived using a large annotated Persian text corpus and were optimized by a forward selection method. In order to evaluate the performance of our approaches, we examined nine baseline models including state-of-the-art approaches for recognition of Ezafe markers in Persian text. Our experiments on Persian Ezafe recognition based on neural approaches employing some optimized features into the models show that they can drastically improve the results of the baselines. They can also achieve better results than the Conditional Random Field method as the best-performing baseline. On the other hand, although the results of the NMT approach show a better performance compared to other baseline approaches, it cannot achieve better performance than the Neural Sequence Labeling method. The best achieved F1-measure based on neural sequence labeling is 96.29%
[1] Bijankhan, M., Sheykhzadegan, J., Bahrani, M., & Ghayoomi, M. (2011). Lessons from building a Persian written corpus: Peykare. Language resources and evaluation, 45(2), 143-164.
[2] Bahaadini, S., Sameti, H., & Khorram, S. (2011, September). Implementation and evaluation of statistical parametric speech synthesis methods for the Persian language. In Machine Learning for Signal Processing (MLSP), 2011 IEEE International Workshop on (pp. 1-6). IEEE.
[3] Oskouipour, N. (2011). Converting Text to phoneme stream with the ability to recognizing ezafe marker and homographs applied to Persian speech synthesis. Msc. Thesis, Sharif University of Technology, Iran.
[4] Maleki, J., Yaesoubi, M., & Ahrenberg, L. (2009, July). Applying Finite State Morphology to Conversion Between Roman and Perso-Arabic Writing Systems.Proceedings of the 2009 conference on Finite-State Methods and Natural Language Processing: Post-proceedings of the 7th International Workshop FSMNLP 2008 (pp. 215-223).
[5] Nourian, A, Rasooli, M. S., Imany, M., and Faili, H., (2015) On the Importance of Ezafe Construction in Persian Parsing, The 53rd Annual Meeting of the Association for Computational Linguistics (ACL) and the 7h International Joint Conference on Natural Language Processing (IJCNLP), Beijing, China, July 2015., Volume 2: Short Papers: 877.
[6] Asghari, H., Maleki, J., & Faili, H. (2014). A Probabilistic Approach to Persian Ezafe Recognition. 14th Conference of the European Chapter of the Association for Computational Linguistics(EACL 2014 ), 138. 26–30 April 2014, Gothenburg, Sweden.
[7] Noferesti, S., and Shamsfard, M., (2014). A Hybrid Algorithm for Recognizing the Position of Ezafe Constructions in Persian Texts.International Journal of Artificial Intelligence and Interactive Multimedia (IJIMAI) 2(6): 17-25 (2014).
[8] Isapour, S., Homayounpour, M. M., and Bijankhan, M. (2007). Identification of ezafe location in Persian language with Probabilistic Context Free Grammar, 13th Computer Association Conference, Kish Island, Iran.
[9] Farghaly, A., (2004). Computer Processing of Arabic Script-based Languages: Current State and Future Directions. Workshop on Computational Approaches to Arabic Script-based Languages, COLING 2004, University of Geneva, Geneva, Switzerland, August 28, 2004.
[10] Asgari, E., & Mofrad, M. R. (2016). Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance. arXiv preprint arXiv:1604.08561.
[11] Marno, H., Langus, A., Omidbeigi, M., Asaadi, S., Seyed-Allaei, S., & Nespor, M. (2015). A new perspective on word order preferences: the availability of a lexicon triggers the use of SVO word order. Frontiers in Psychology, 6.
[12] Moghaddam, M. D. (2001). Word order typology of Iranian languages. The Journal of Humanities of the Islamic Republic of Iran.–2001 (Spring), 8(2), 17-23.
[13] Parsafar P. (2010). Syntax, Morphology, and Semantics of Ezafe. Iranian Studies [serial online]. December 2010; 43 (5): 637-666. Available in Academic Search Complete, Ipswich, MA.
[14] Bögel, T., Butt, M., and Sulger, S., (2008). Urdu ezafe and the morphology-syntax interface. Proceedings of LFG08 (2008). CSLI Publications Stanford.
[15] Estaji, A., and Jahangiri, N. (2006). The origin of kasre ezafe in Persian language. Journal of Persian language and literature, Vol. 47, pp. 69-82, Isfahan University, Iran.
[16] Samvelian, P. (2007). The Ezafe as a head-marking inflectional affix: Evidence from Persian and Kurmanji Kurdish. Aspects of Iranian Linguistics: Papers in Honor of Mohammad Reza Bateni, 339-361.
[17] Megerdoomian, K. (2000). A computational analysis of the Persian noun phrase. Memoranda in Computer and Cognitive Science MCCS-00-321, Computing Research Lab, New Mexico State University.
[18] Bijankhan, M. (2005). A feasibility study on Ezafe Domain Analysis based on pattern matching method. Published by Research Institute on Culture, Art, and Communication, Tehran, Iran.
[19] Zahedi, M. (1998). Design and Implementation of an Intelligent Program for Recognizing Short Vowels in Persian Text. Msc. Thesis, University of Tehran, Iran.
[20] Mavvaji, V., and Eslami, M., (2012). Converting Persian Text to Phoneme Stream Based on a Syntactic Analyser. The first international conference on Persian text and speech, September 5,6, 2012, Semnan, Iran.
[21] Razi, B., and Eshqi, M., (2012). Design of a POS tagger for Persian speech based on Neural Networks, 20th Conference on Electrical Engineering, 15-17 May 2012, Tehran, Iran.
[22] Chen, X., Qiu, X., Zhu, C., Liu, P., & Huang, X. (2015). Long short-term memory neural networks for Chinese word segmentation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1197-1206).
[23] Ma, X., & Hovy, E. (2016). End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1064-1074).
[24] Cuong, N. V., Ye, N., Lee, W. S., & Chieu, H. L. (2014). Conditional random field with high-order dependencies for sequence labeling and segmentation. The Journal of Machine Learning Research, 15(1), 981-1009.
[25] Farghaly, A., & Shaalan, K. (2009). Arabic natural language processing: Challenges and solutions. ACM Transactions on Asian Language Information Processing (TALIP), 8(4), 14.
[26] Elshafei, M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2006 a). Machine Generation of Arabic Diacritical Marks. MLMTA, 2006, 128-133.
[27] Elshafei, M., Al-Muhtaseb, H., & Al-Ghamdi, M. (2006 b). Statistical methods for automatic diacritization of Arabic text. In The Saudi 18th National Computer Conference. Riyadh (Vol. 18, pp. 301-306).
[28] Diab, M., Ghoneim, M., & Habash, N. (2007, September). Arabic diacritization in the context of statistical machine translation. In Proceedings of MT-Summit.
[29] Habash, N., & Rambow, O. (2007, April). Arabic diacritization through full morphological tagging. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2007); Companion Volume, pp. 53-56, Association for Computational Linguistics.
[30] Belinkov, Y., & Glass, J. (2015). Arabic diacritization with recurrent neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2281-2285).
[31] Abandah, G. A., Graves, A., Al-Shagoor, B., Arabiyat, A., Jamour, F., & Al-Taee, M. (2015). Automatic diacritization of Arabic text using recurrent neural networks. International Journal on Document Analysis and Recognition (IJDAR), 18(2), 183-197.
[32] De Mareüil, P. B., Adda-Decker, M., & Gendner, V. (2003). Liaisons in French: a corpus-based study using morpho-syntactic information. In Proc. of the 15th International Congress of Phonetic Sciences.
[33] Larson, R.K. (2009), Chinese as a reverse Ezafe language. Yuyanxue Luncong, Journal of Linguistics, 39: 30-85. Peking University.
[34] Ling, W., Dyer, C., Black, A. W., Trancoso, I., Fermandez, R., Amir, S., ... & Luis, T. (2015). Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1520-1530).
[35] Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1756-1765).
[36] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780.
[37] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541-551.
[38] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. In Proceedings of NAACL-HLT (pp. 260-270).
[39] Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, NAACL ’03, pages 48–54, Stroudsburg, PA, USA. Association for Computational Linguistics.
[40] Kalchbrenner, N. and Blunsom, P. (2013). Recurrent continuous translation models. In Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1700–1709. Association for Computational Linguistics.
[41] Sutskever, I., Vinyals, O., and Le, Q. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS 2014).
[42] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014a). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
[43] Hermann, K. M., & Blunsom, P. (2014). Multilingual models for compositional distributed semantics. arXiv preprint arXiv:1404.4641.
[44] Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[45] Bijankhan, M. (2004). The Role of the Corpus in Writing a Grammar: An Introduction to a Software, Iranian Journal of Linguistics, Vol. 19, No. 2, fall and winter 2004.
[46] Amiri, H., Hojjat, H., & Oroumchian, F. (2007). Investigation on a feasible corpus for Persian POS tagging. In Proceedings of the 12th International CSI Computer Conference (CSICC), 2007.
[47] Mohtaj, Salar, Behnam Roshanfekr, Atefeh Zafarian, Habibollah Asghari, (2018), Parsivar: A Language Processing Toolkit for Persian, 11th edition of the Language Resources and Evaluation Conference (LREC 2018), 7-12 May 2018, Miyazaki (Japan).
[48] Arppe, A. (2000). Developing a grammar checker for Swedish. In Proceedings of NODALIDA (Vol. 99, pp. 13-27).
[49] Bernth, A. (1997, March). EasyEnglish: a tool for improving document quality. In Proceedings of the fifth conference on applied natural language processing (pp. 159-165). Association for Computational Linguistics.
[50] Powers, D.M.W., (2011). Evaluation: from Precision, Recall, and F-measure to ROC, Informedness, Markedness, and Correlation. Journal of Machine Learning Technologies, 2(1), 37-63.
[51] Bäck, T. (1996). Evolutionary algorithms in theory and practice. Oxford University Press.
[52] Koehn, P., & Hoang, H. (2007, June). Factored Translation Models. Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, In EMNLP-CoNLL 2007, pp. 868–876, Prague, June 2007. Association for Computational Linguistics.
[53] Feng, Y., Cohn, T., & Du, X. (2014). Factored Markov translation with robust modeling. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (pp. 151-159).
[54] Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. (2017). OpenNMT: Open-Source Toolkit for Neural Machine Translation. Proceedings of ACL 2017, System Demonstrations, 67-72.
[55] Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4), 543-565.
[56] Kahnemuyipour, Arsalan (2003). Syntactic categories and Persian stress. Natural Language & Linguistic Theory 21.2: 333-379.
[57] Ghomeshi, J. (1996). Projection and Inflection: A Study of Persian Phrase Structure. Ph.D. Thesis, Graduate Department of Linguistics, University of Toronto.
[58] Doostmohammadi, E., Nassajian, M., & Rahimi, A. (2020, November). Persian Ezafe Recognition Using Transformers and Its Role in Part-Of-Speech Tagging. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 961-971).
[59] Larson, R., & Samiian, V. (2020). The Ezafe construction revisited. Advances in Iranian linguistics, 351, 173.
[60] Ansari, A., Ebrahimian, Z., Toosi, R., & Akhaee, M. A. (2023, May). Persian Ezafeh Recognition using Transformer-Based Models. In 2023 9th International Conference on Web Research (ICWR) (pp. 283-288). IEEE.
[61] Kharsa, R., Elnagar, A., & Yagi, S. (2024). BERT-Based Arabic Diacritization: A state-of-the-art approach for improving text accuracy and pronunciation. Expert Systems with Applications, p. 123416.
[62] Lapointe, M., Kadim, A., & Dliou, A. (2023, November). Literature Review of Automatic Restoration of Arabic Diacritics. In 2023 IEEE International Conference on Advances in Data-Driven Analytics And Intelligent Systems (ADACIS) (pp. 1-5). IEEE.