Recognizing Transliterated English Words in Persian Texts
الموضوعات :Ali Hoseinmardy 1 , Saeedeh Momtazi 2
1 - Amirkabir University
2 - AmirKabir University
الکلمات المفتاحية: Transliteration , Text processing , Words Relation , Neural Network-Based Sequence2Sequence Model , Google Translate , Behnevis,
ملخص المقالة :
One of the most important problems of text processing systems is the word mismatch problem. This results in limited access to the required information in information retrieval. This problem occurs in analyzing textual data such as news, or low accuracy in text classification and clustering. In this case, if the text-processing engine does not use similar/related words in the same sense, it may not be able to guide you to the appropriate result. Various statistical techniques have been proposed to bridge the vocabulary gap problem; e.g., if two words are used in similar contexts frequently, they have similar/related meanings. Synonym and similar words, however, are only one of the categories of related words that are expected to be captured by statistical approaches. Another category of related words is the pair of an original word in one language and its transliteration from another language. This kind of related words is common in non-English languages. In non-English texts, instead of using the original word from the target language, the writer may borrow the English word and only transliterate it to the target language. Since this kind of writing style is used in limited texts, the frequency of transliterated words is not as high as original words. As a result, available corpus-based techniques are not able to capture their concept. In this article, we propose two different approaches to overcome this problem: (1) using neural network-based transliteration, (2) using available tools that are used for machine translation/transliteration, such as Google Translate and Behnevis. Our experiments on a dataset, which is provided for this purpose, shows that the combination of the two approaches can detect English words with 89.39% accuracy.
[1] P. Bahar, C. Brix, H. Ney. “Towards Two-Dimensional Sequence to Sequence Model in Neural Machine Translation.” Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, p. 3009–3015.
[2] M. Bijankhan, J. Sheykhzadegan, M. Bahrani, M. Ghayoomi. “Lessons from building a Persian written corpus: Peykare”, Language Resources and Evaluation, Vol. 45, No. 2, 2011, pp. 143-164.
[3] M.X. Chen, O. Firat., A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, Y. Wu, M. Hughes. “The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation.” arXiv:1804.09849, 2018.
[4] C.C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, M. Bacchiani, “State-of-the-Art Speech Recognition with Sequence-to-Sequence Models.”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4774-4778.
[5] M. Ghayoomi, S. Momtazi, M. Bijankhan. “A study of corpus development for Persian.” International Journal on Asian Language Processing, 2010.
[6] N. S. Kharusi and A. Salman, “The English Transliteration of Place Names in Oman”. Journal of Academic and Applied Studies Vol. 1, No. 3, 2011, pp. 1–27.
[7] B. Li, "A Question Answering System Using Encoder-Decoder Sequence-to-Sequence Recurrent Neural Networks.", Master Thesis, The Faculty of the Department of Computer Science, San José’s State University, 2018.
[8] M. Marcus, B. Santorini, M.A. Marcinkiewicz. “Building a Large Annotated Corpus of English: The Penn Treebank”, Journal of Computational Linguistics, Vo. 19, N. 2, 1993, pp. 313-330.
[9] F.P. Miller, A.F. Vandome, and J. McBrewster, Levenshtein Distance: Information Theory, Computer Science, String (Computer Science), String Metric, Damerau?Levenshtein Distance, Spell Checker, Hamming Distance, 2009, Alpha Press.
[10] M. Mo'in, Mo'in Encyclopedic Dictionary, Amirkabir Publisher, 1972.
[11] A. Otsuka, K. Nishida, K. Bessho, H. Asano, and J. Tomita. “Query Expansion with Neural Question-to-Answer Translation for FAQ-based Question Answering”. In Proceedings of the Web Conference (WWW), 2018. pp. 1063-1068.
[12] A. Rücklé, K. Swarnkar, I. Gurevych. “Improved Cross-Lingual Question Retrieval for Community Question Answering.”, In Proceedings of the Web Conference (WWW), 2019, pp. 3179-3186.
[13] J. Schmidhuber, “Deep Learning in Neural Networks: An Overview”. Neural Networks. Vol. 61, 2015, 85–117.
[14] L. Schubert and M. Tong. “Extracting and evaluating general world knowledge from the Brown corpus”. In Proceedings of the HLT-NAACL workshop on Text meaning, Association for Computational Linguistics, 2003, Vol. 9, pp. 7-13.
[15] I. Sutskever, O.Vinyals, Q.V. Le. “Sequence to Sequence Learning with Neural Networks.” In Proceedings Neural Information Processing Systems (NIPS), 2014.
[16] Terminology Department, "A collection of terms approved by the Academy of Persian Language and Literature". Vol. 6, 2009, Academy of Persian Language and Literature. Tehran. (ISBN 978-964-7531-85-6)
[17] E.D. Vries, M. Schoonvelde, and G. Schumacher, “No Longer Lost in Translation: Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications”. Political Analysis, Vol. 26, No. 4, 2018, pp. 417-430.
[18] B. Wei, S. Lu, L. Mou, H. Zhou, P. Poupart, G. Li, Z. Jin, “Why Do Neural Dialog Systems Generate Short and Meaningless Replies? a Comparison between Dialog and Translation”, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019. pp. 7290-7294.
[19] M. Zhang, H. Li, A. Kumaran, M. Liu, “Report of NEWS 2011 Machine Transliteration Shared Task”. In Proceedings of the 5th international Joint Conference on Natural Language Processing (lJCNLP), 2011.