تحلیل نور: یک دادگان معیار برای ارزیابی روش‌های برچسب‌گذاری صرفی

الموضوعات : فناوری اطلاعات و ارتباطات

هدی الشهیب ¹ , بهروز مینایی ² , محمد ابراهیم شناسا ³ , Sayyed Ali Hossayni ⁴

1 - دانشجوی دکتری
2 - .
3 - .
4 - دانشجوی پسادکتری

تاريخ الإرسال : 01 الأربعاء , ربيع الثاني, 1444 تاريخ التأكيد : 09 الأربعاء , شعبان, 1444 تاريخ الإصدار : 04 الثلاثاء , ربيع الأول, 1445

الکلمات المفتاحية: ریخت‌شناسی, زبان عربی, حاشیه‌نویسی, دادگان, برچسب‌گذاری صرفی,

ملخص المقالة :

زبان عربی ریخت‌‌شناسی بسیار غنی و پیچیده‌ای دارد که برای تحلیل زبان عربی و به ویژه در متون عربی سنتی مانند متون تاریخی و مذهبی بسیار مفید است و در فهم معنای متون کمک می‌کند. در مجموعه داده‌های ریخت‌شناسی تنوع برچسب و تعداد نمونه‌های دادگان به ارزیابی روش‌های ریخت‌شناسی کمک بیشتری می‌کند، در این پژوهش مجموعه داده ریخت‌شناسی که ارائه می‌کنیم شامل حدود ۲۲۳۶۹۰ کلمه از كتاب شرائع الاسلام است که توسط متخصصین برچسب‌گذاری شده است که این مجموعه دادگان از نظر حجم و تنوع برچسب‌ها نسبت به سایر دادگان‌هایی که برای تحلیل ریخت‌شناسی عربی ارائه داده شده است برتر می‌باشد. برای ارزیابی دادگان، سامانه فراسه را بر روی متون اعمال کردیم و کیفیت حاشیه‌نویسی را از طريق چهار معيار بر روی سامانه فراسه گزارش می‌کنیم.

المصادر:

[1] Buckwalter,T., Buckwalter Arabic morphological analyzer version 1.0. Linguistic Data Consortium, University of Pennsylvania, 2002.
[2] Buckwalter, T., Buckwalter Arabic morphological analyzer version 2.0. Linguistic data consortium, university of Pennsylvania, 2002. LDC cat alog no. 2004, Ldc2004l02. Technical report.
[3] Graff D, Maamouri M, Bouziri B, Krouna S, Kulick S, Buckwalter T. Standard arabic morphological analyzer (SAMA). Linguistic Data Consortium LDC2009E73, 2010.
[4] Maamouri, M., et al. The penn Arabic treebank: Building a large-scale annotated Arabic corpus. In NEMLAR conference on Arabic language resources and tools. 2004. Cairo.
[5] Elghamry, K. A constraint-based algorithm for the identification of Arabic roots. In Proceedings of the 1st Midwest Computational Linguistics Colloquium. 2004. Indiana Univ. Bloomington.
[6] Habash, N. and O. Rambow. MAGEAD: A morphological analyzer and generator for the Arabic dialects. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. 2006.
[7] Rodrigues, P. and D. Cavar, Learning Arabic morphology using statistical constraint-satisfaction models. AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE SERIES 4, 2007. 289: p. 63.
[8] Habash, N., Arabic morphological representations for machine translation, in Arabic computational morphology. 2007, Springer. p. 263-285.
[9] Smrz, O. ElixirFM–implementation of functional Arabic morphology. In Proceedings of the 2007 workshop on computational approaches to Semitic languages: common issues and resources. 2007.
[10] Daya, E., D. Roth, and S. Wintner, Identifying Semitic roots: Machine learning with linguistic constraints. Computational Linguistics, 2008. 34(3): p. 429-448.
[11] Roth, D. Learning to resolve natural language ambiguities: A unified approach. In AAAI/IAAI. 1998.
[12] Snyder, B. and R. Barzilay. Unsupervised multilingual learning for morphological segmentation. In Proceedings of acl-08: hlt. 2008.
[13] Poon, H., C. Cherry, and K. Toutanova. Unsupervised morphological segmentation with log-linear models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2009.
[14] Boudlal, A., et al., A Markovian approach for Arabic root extraction. Int. Arab J. Inf. Technol., 2011. 8(1): p. 91-98.
[15] Attia, M., et al. An open-source finite state morphological transducer for modern standard Arabic. In Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing. 2011.
[16] Fullwood, M. and T. O’Donnell. Learning non-concatenative morphology. In Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL). 2013.
[17] Khaliq, B. and J.A. Carroll. Induction of root and pattern lexicon for unsupervised morphological analysis of Arabic. In Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013.
[18] Khalifa, S., S. Hassan, and N. Habash. A morphological analyzer for Gulf Arabic verbs. In Proceedings of the Third Arabic Natural Language Processing Workshop. 2017.
[19] Khalifa, S., N. Zalmout, and N. Habash. Morphological analysis and disambiguation for Gulf Arabic: The interplay between resources and methods. In Proceedings of the 12th Language Resources and Evaluation Conference. 2020.
[20] Taji, D., et al. An Arabic morphological analyzer and generator with copious features. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2018.
[21] Khalifa, S., et al., A large scale corpus of Gulf Arabic. ArXiv preprint arXiv: 1609.02960, 2016.
[22] Graff, D., et al., Standard Arabic morphological analyzer (SAMA) version 3.1. Linguistic Data Consortium LDC2009E73, 2009: p. 53-56.
[23] Habash, N., R. Eskander, and A. Hawwari. A morphological analyzer for Egyptian Arabic. In Proceedings of the twelfth meeting of the special interest group on computational morphology and phonology. 2012.
[24] Gridach, M. and N. Chenfour. Developing a new system for Arabic morphological analysis and generation. In Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP). 2011.
[25] Zalmout, N. and N. Habash. Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.
[26] Boudlal, A., et al. Alkhalil morpho sys1: A morphosyntactic analysis system for Arabic texts. In International Arab conference on information technology. 2010. Elsevier Science Inc New York, NY.
[27] Boudchiche, M., et al., AlKhalil Morpho Sys 2: A robust Arabic morpho-syntactic analyzer. Journal of King Saud University-Computer and Information Sciences, 2017. 29(2): p. 141-146.
[28] Zribi, I., M.E. Khemekhem, and L.H. Belguith. Morphological analysis of Tunisian dialect. In Proceedings of the Sixth International Joint Conference on Natural Language Processing. 2013.
[29] Alkuhlani, S. and N. Habash. A corpus for modeling morpho-syntactic agreement in Arabic: gender, number and rationality. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011.
[30] Pasha, A., et al. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In LREC. 2014. Citeseer.
[31] Habash, N. and O. Rambow. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05). 2005.
[32] Habash, N., O. Rambow, and R. Roth. MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools (MEDAR), Cairo, Egypt. 2009.
[33] Habash, N., et al. Morphological analysis and disambiguation for dialectal Arabic. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013.
[34] Diab, M., K. Hacioglu, and D. Jurafsky, Automated methods for processing Arabic text: from tokenization to base phrase chunking. Arabic computational morphology: Knowledge-based and empirical methods. Kluwer/Springer, 2007.
[35] Alansary, S., Basma: Bibalex standard Arabic morphological analyzer. The Egyptian Journal of Language Engineering, 2016. 3(1): p. 24-33.
[36] Khalifa, S., N. Zalmout, and N. Habash. Yamama: Yet another multi-dialect Arabic morphological analyzer. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. 2016.
[37] Kilany, H., et al., Egyptian colloquial Arabic lexicon. LDC catalog number LDC99L22, 2002.
[38] Alshargi, F., et al. Morphologicaly annotated corpora for seven Arabic dialects: Taizi, sanaani, najdi, Jordanian, Syrian, Iraqi and Moroccan. In Proceedings of the Fourth Arabic Natural Language Processing Workshop. 2019.
[39] Hajic, J., et al. Prague Arabic dependency treebank: Development in data and tools. In Proc. of the NEMLAR Intern. Conf. on Arabic Language Resources and Tools. 2004.
[40] Zemánek, P. CLARA (Corpus Linguae Arabicae): An Overview. In Proceedings of ACL/EACL Workshop on Arabic Language. 2001.
[41] Zeroual, I. and A. Lakhouaja, A new Quranic Corpus rich in morphosyntactical information. International Journal of Speech Technology, 2016. 19(2): p. 339-346.
[42] Dukes, K. and N. Habash. Morphological Annotation of Quranic Arabic. In Lrec. 2010. Citeseer.
[43] Imad, Z. and L. Abdelhak, Al-Mus' haf Corpus: A New Quranic Corpus rich in Morphosyntactical Information and accurate Part of Speech tagging.
[44] Dror, J., et al., Morphological Analysis of the Qur'an. Literary and linguistic computing, 2004. 19(4): p. 431-452.
[45] E., A., Corpus resources for learning Arabic to understand the Quran. Higher Education Academy workshop on "The Role of Corpora in LSP (Language for Specific Purposes) Learning and Teaching", 2012.
[46] Zeroual, I. and A. Lakhouaja. Clitiques-Stemmer: nouveau stemmer pour la langue Arabe. In The First National Doctoral Symposium on Arabic Language Engineering (JDILA'2014). 2014.
[47] Hegazi, M., A. Hilal, and M. Alhawarat, Fine-Grained Quran Dataset. International Journal of Advanced Computer Science and Applications, 2015. 6.
[48] Abdelali, A., et al. Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations. 2016.
[49] Monroe, W., S. Green, and C.D. Manning. Word segmentation of informal Arabic with domain adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2014.
[50] Obeid, O., et al. CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the 12th language resources and evaluation conference. 2020.

شارک

عنوان URL للمقالة

تحلیل نور: یک دادگان معیار برای ارزیابی روش‌های برچسب‌گذاری صرفی

رایمگ

الروابط

المراكز ذات الصلة

دعامة

الصفحات الرسمية