Phrase Segmentation on Persian Texts Using Neural Networks
Subject Areas : electrical and computer engineeringM. M. Mirdamadi 1 , A. M. Zareh Bidoki 2 , M. Rezaeian 3
1 - Yazd University
2 - Yazd University
3 - Yazd University
Keywords: Segmentation natural languages processing search engine neural networks,
Abstract :
Word and phrase segmentation is one of the main activities in natural languages processing (NLP). Many programs in NLP need to be preprocessed for extraction of text’s words and distinction phrases. Getting meaningful words with their prefix and suffix is the main and the final goal of segmentation. This activity depends on various natural languages can be easy or hard. Persian is among the languages with complex preprocessing tasks. One of the complexity sources is handling different writing scripts. In written Persian texts, we have two kinds of spaces: short space and white space. Also there are various scripts for writing Persian texts, differing in the style of writing words, using or elimination of spaces within or between words, using various forms of characters and so on. In this paper, we want to suggest a statistical method for phrase segmentation on Persian texts using neural networks due to using in search engines. For this purpose, we use occurrence likelihood of uniwords and biwords in corpus. The suggested algorithm includes four steps and could detect about 89.6% of correct tokens. Experimental results show this method can improve the performance of the usual methods
[1] م. محمدی جنقرا و م. آنالویی، "استخراج كلمات كليدی اسناد فارسی،" سيزدهمين كنفرانس سالانه انجمن كامپيوتر ايران، جزیره کیش، اسفند 1386.
[2] B. Habert, et al., "Towards tokenization evaluation," in Proc. 1st Int.l Conf. on Language Resources and Evaluation, LREC, vol. 1, pp. 427-431, Spain, May 1998.
[3] س. کیانی و م. شمسفرد، "تعیین مرز کلمات و عبارات در متون نوشتاری فارسی،" چهاردهمين كنفرانس سالانه انجمن كامپيوتر ايران، تهران، اسفند 1387.
[4] س. م. غفوری، س. راحتی، م. ر. پهلواننژاد و ع. عظیمیزاده، "نرمالساز متون فارسی،" پانزدهمین کنفرانس بینالمللی سالانه انجمن کامپیوتر ایران، تهران، اسفند 1388.
[5] M. Shamsfard, S. Kiani, and Y. Shahedi, "Step - 1: standard text preparation for Persian Language," in Proc. of the 3rd Workshop on ComputationalApproaches to Arabic Script-based Languages MTSummit XII, Ottawa, Canada, 2009.
[6] T. Chung and D. Gildea, "Unsupervised tokenization for machine translation," in Proc. of the 2009 Conf. on Empirical Methods in Natural Language Processing, vol. 2, pp. 718-726, Singapore, Aug. 2009.
[7] O. Frunza, "A trainable tokenizer, solution for multilingual texts and compound expression tokenization," in Proc. of the 6th Int. Conf. on Language Resources and Evaluation, LREC'08, Marrakech, May 2008.
[8] J. Grana, M. A. Alonso, and M. Vilares, "A common solution for tokenization and part - of - speech tagging," in Proc. of the 5th Int. Conf. on Text, Speech, and Dialogue, TSD'02, vol. 1, pp. 3-11, London, Sep. 2002.
[9] T. V. Nguyen, H. K. Tran, T. T. Nguyen, and H. Nguyen, "Word segmentation for vietnamese text categorization: an online corpus approach," in Proc. 4th IEEE Int. Conf. in Computer Science, Research, Innovation and Vision of the Future, RIVF'06, Hochiminh, Vietnam, Feb. 2006.
[10] V. Tesprasit, P. Charenpornsawat, and V. Sornlertlamvanich, "Learning phrase break detection in thai text - to - speech," in Proc. of 8th European Conf. on Speech Communication and Technology, Geneva, Switzerland, Sep. 2003.
[11] S. Kiani, T. Akhavan, and M. Shamsfard, "Developing a persian chunker using a hybrid approach," in Proc. of IEEE Int. Multiconf. on. Computer Science and Information Technology, IMCSIT'09, vol. 1, pp. 227-234, Oct. 2009.
[12] BijanKhan Corpus, http://ece.ut.ac.ir/dbrg/Bijankhan/, 2012.
[13] Parsijoo Search Engine, http://www.parsijoo.ir, 2012.