Noor Analysis: A Benchmark Dataset for Evaluating Morphological Analysis Engines
Subject Areas : ICTHuda Al-Shohayyeb 1 , Behrooz Minaei 2 , Mohammad Ebrahim Shenassa 3 , Sayyed Ali Hossayni 4
1 - Phd Student
2 -
3 -
4 -
Keywords: Morphology, Arabic Language, Annotation, Dataset, Morphological Analysis,
Abstract :
The Arabic language has a very rich and complex morphology, which is very useful for the analysis of the Arabic language, especially in traditional Arabic texts such as historical and religious texts, and helps in understanding the meaning of the texts. In the morphological data set, the variety of labels and the number of data samples helps to evaluate the morphological methods, in this research, the morphological dataset that we present includes about 22, 3690 words from the book of Sharia alـIslam, which have been labeled by experts, and this dataset is the largest in terms of volume and The variety of labels is superior to other data provided for Arabic morphological analysis. To evaluate the data, we applied the Farasa system to the texts and we report the annotation quality through four evaluation on the Farasa system.
[1] Buckwalter,T., Buckwalter Arabic morphological analyzer version 1.0. Linguistic Data Consortium, University of Pennsylvania, 2002.
[2] Buckwalter, T., Buckwalter Arabic morphological analyzer version 2.0. Linguistic data consortium, university of Pennsylvania, 2002. LDC cat alog no. 2004, Ldc2004l02. Technical report.
[3] Graff D, Maamouri M, Bouziri B, Krouna S, Kulick S, Buckwalter T. Standard arabic morphological analyzer (SAMA). Linguistic Data Consortium LDC2009E73, 2010.
[4] Maamouri, M., et al. The penn Arabic treebank: Building a large-scale annotated Arabic corpus. In NEMLAR conference on Arabic language resources and tools. 2004. Cairo.
[5] Elghamry, K. A constraint-based algorithm for the identification of Arabic roots. In Proceedings of the 1st Midwest Computational Linguistics Colloquium. 2004. Indiana Univ. Bloomington.
[6] Habash, N. and O. Rambow. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 2006.
[7] Rodrigues, P. and D. Cavar, Learning Arabic morphology using statistical constraint-satisfaction models. AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE SERIES 4, 2007. 289: p. 63.
[8] Habash, N., Arabic morphological representations for machine translation, in Arabic computational morphology. 2007, Springer. p. 263-285.
[9] Smrz, O. ElixirFM–implementation of functional Arabic morphology. In Proceedings of the 2007 workshop on computational approaches to Semitic languages: common issues and resources. 2007.
[10] Daya, E., D. Roth, and S. Wintner, Identifying Semitic roots: Machine learning with linguistic constraints. Computational Linguistics, 2008. 34(3): p. 429-448.
[11] Roth, D. Learning to resolve natural language ambiguities: A unified approach. In AAAI/IAAI. 1998.
[12] Snyder, B. and R. Barzilay. Unsupervised multilingual learning for morphological segmentation. In Proceedings of acl-08: hlt. 2008.
[13] Poon, H., C. Cherry, and K. Toutanova. Unsupervised morphological segmentation with log-linear models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009.
[14] Boudlal, A., et al., A Markovian approach for Arabic root extraction. Int. Arab J. Inf. Technol., 2011. 8(1): p. 91-98.
[15] Attia, M., et al. An open-source finite state morphological transducer for modern standard Arabic. In Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing. 2011.
[16] Fullwood, M. and T. O’Donnell. Learning non-concatenative morphology. In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL). 2013.
[17] Khaliq, B. and J.A. Carroll. Induction of root and pattern lexicon for unsupervised morphological analysis of Arabic. In Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013.
[18] Khalifa, S., S. Hassan, and N. Habash. A morphological analyzer for Gulf Arabic verbs. In Proceedings of the Third Arabic Natural Language Processing Workshop. 2017.
[19] Khalifa, S., N. Zalmout, and N. Habash. Morphological analysis and disambiguation for Gulf Arabic: The interplay between resources and methods. In Proceedings of the 12th Language Resources and Evaluation Conference. 2020.
[20] Taji, D., et al. An Arabic morphological analyzer and generator with copious features. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2018.
[21] Khalifa, S., et al., A large scale corpus of Gulf Arabic. ArXiv preprint arXiv: 1609.02960, 2016.
[22] Graff, D., et al., Standard Arabic morphological analyzer (SAMA) version 3.1. Linguistic Data Consortium LDC2009E73, 2009: p. 53-56.
[23] Habash, N., R. Eskander, and A. Hawwari. A morphological analyzer for Egyptian Arabic. In Proceedings of the twelfth meeting of the special interest group on computational morphology and phonology. 2012.
[24] Gridach, M. and N. Chenfour. Developing a new system for Arabic morphological analysis and generation. In Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP). 2011.
[25] Zalmout, N. and N. Habash. Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.
[26] Boudlal, A., et al. Alkhalil morpho sys1: A morphosyntactic analysis system for Arabic texts. In International Arab conference on information technology. 2010. Elsevier Science Inc New York, NY.
[27] Boudchiche, M., et al., AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University-Computer and Information Sciences, 2017. 29(2): p. 141-146.
[28] Zribi, I., M.E. Khemekhem, and L.H. Belguith. Morphological analysis of Tunisian dialect. In Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013.
[29] Alkuhlani, S. and N. Habash. A corpus for modeling morpho-syntactic agreement in Arabic: gender, number and rationality. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011.
[30] Pasha, A., et al. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In LREC. 2014. Citeseer.
[31] Habash, N. and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05). 2005.
[32] Habash, N., O. Rambow, and R. Roth. MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt. 2009.
[33] Habash, N., et al. Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.
[34] Diab, M., K. Hacioglu, and D. Jurafsky, Automated methods for processing Arabic text: from tokenization to base phrase chunking. Arabic computational morphology: Knowledge-based and empirical methods. Kluwer/Springer, 2007.
[35] Alansary, S., Basma: Bibalex standard Arabic morphological analyzer. The Egyptian Journal of Language Engineering, 2016. 3(1): p. 24-33.
[36] Khalifa, S., N. Zalmout, and N. Habash. Yamama: Yet another multi-dialect Arabic morphological analyzer. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. 2016.
[37] Kilany, H., et al., Egyptian colloquial Arabic lexicon. LDC catalog number LDC99L22, 2002.
[38] Alshargi, F., et al. Morphologicaly annotated corpora for seven Arabic dialects: Taizi, sanaani, najdi, Jordanian, Syrian, Iraqi and Moroccan. In Proceedings of the Fourth Arabic Natural Language Processing Workshop. 2019.
[39] Hajic, J., et al. Prague Arabic dependency treebank: Development in data and tools. In Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools. 2004.
[40] Zemánek, P. CLARA (Corpus Linguae Arabicae): An Overview. In Proceedings of ACL/EACL Workshop on Arabic Language. 2001.
[41] Zeroual, I. and A. Lakhouaja, A new Quranic Corpus rich in morphosyntactical information. International Journal of Speech Technology, 2016. 19(2): p. 339-346.
[42] Dukes, K. and N. Habash. Morphological Annotation of Quranic Arabic. In Lrec. 2010. Citeseer.
[43] Imad, Z. and L. Abdelhak, Al-Mus' haf Corpus: A New Quranic Corpus rich in Morphosyntactical Information and accurate Part of Speech tagging.
[44] Dror, J., et al., Morphological Analysis of the Qur'an. Literary and linguistic computing, 2004. 19(4): p. 431-452.
[45] E., A., Corpus resources for learning Arabic to understand the Quran. Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes) Learning and Teaching", 2012.
[46] Zeroual, I. and A. Lakhouaja. Clitiques-Stemmer: nouveau stemmer pour la langue Arabe. In The First National Doctoral Symposium on Arabic Language Engineering (JDILA'2014). 2014.
[47] Hegazi, M., A. Hilal, and M. Alhawarat, Fine-Grained Quran Dataset. International Journal of Advanced Computer Science and Applications, 2015. 6.
[48] Abdelali, A., et al. Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations. 2016.
[49] Monroe, W., S. Green, and C.D. Manning. Word segmentation of informal Arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2014.
[50] Obeid, O., et al. CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the 12th language resources and evaluation conference. 2020.