A Semantic Approach to Person Profile Extraction from Farsi Web Documents
محورهای موضوعی : Semantic WebHojjat Emami 1 , Hossein Shirazi 2 , ahmad abdolahzade 3
1 - Malek - Ashtar University of Technology
2 - Malek-e ashtar university of technology
3 - Amir Kabir University
کلید واژه: Web mining , information extraction , entity profiling , Farsi language,
چکیده مقاله :
Entity profiling (EP) as an important task of Web mining and information extraction (IE) is the process of extracting entities in question and their related information from given text resources. From computational viewpoint, the Farsi language is one of the less-studied and less-resourced languages, and suffers from the lack of high quality language processing tools. This problem emphasizes the necessity of developing Farsi text processing systems. As an element of EP research, we present a semantic approach to extract profile of person entities from Farsi Web documents. Our approach includes three major components: (i) pre-processing, (ii) semantic analysis and (iii) attribute extraction. First, our system takes as input the raw text, and annotates the text using existing pre-processing tools. In semantic analysis stage, we analyze the pre-processed text syntactically and semantically and enrich the local processed information with semantic information obtained from a distant knowledge base. We then use a semantic rule-based approach to extract the related information of the persons in question. We show the effectiveness of our approach by testing it on a small Farsi corpus. The experimental results are encouraging and show that the proposed method outperforms baseline methods.
Entity profiling (EP) as an important task of Web mining and information extraction (IE) is the process of extracting entities in question and their related information from given text resources. From computational viewpoint, the Farsi language is one of the less-studied and less-resourced languages, and suffers from the lack of high quality language processing tools. This problem emphasizes the necessity of developing Farsi text processing systems. As an element of EP research, we present a semantic approach to extract profile of person entities from Farsi Web documents. Our approach includes three major components: (i) pre-processing, (ii) semantic analysis and (iii) attribute extraction. First, our system takes as input the raw text, and annotates the text using existing pre-processing tools. In semantic analysis stage, we analyze the pre-processed text syntactically and semantically and enrich the local processed information with semantic information obtained from a distant knowledge base. We then use a semantic rule-based approach to extract the related information of the persons in question. We show the effectiveness of our approach by testing it on a small Farsi corpus. The experimental results are encouraging and show that the proposed method outperforms baseline methods.
[1] P. Saeedi, H. Faili, and A. Shakery, “Semantic role induction in Persian: An unsupervised approach by using probabilistic models,” Lit. Linguist. Comput., 2014.#
[2] M. Shamsfard, “Challenges and open problems in Persian text processing,” in Proceedings of 5th Language & Technology Conference (LTC), Poznań, Poland, 2011, pp. 65–69.#
[3] H. Fadaei and M. Shamsfard, “Extracting conceptual relations from Persian resources,” in ITNG2010 - 7th International Conference on Information Technology: New Generations, Las Vegas, Nevada, USA, 2010, pp. 244–248.#
[4] M. Moradi, B. Vazirnezhad, and M. Bahrani, “Commonsense Knowledge Extraction for Persian Language: A Combinatory Approach,” Iran. J. Inf. Process. Manag., vol. 31, no. 1, pp. 109–124, 2015.#
[5] M. Shamsfard, “Lexico-syntactic and Semantic Patterns for Extracting Knowledge from Persian Texts,” Int. J. Comput. Sci. Eng., vol. 2, no. 6, pp. 2190–2196, 2010.#
[6] S. Soderland, N. Hawkins, G. L. Kim, and D. S.Weld, “University of Washington System for 2015 KBP Cold Start Slot Filling,” in Proceedings of TAC-KBP 2015, Maryland, USA, 2015.#
[7] W. Li, R. Srihari, C. Niu, and X. Li, “Entity profile extraction from large corpora,” in Pacific Association for Computational Linguistics Conference (PACLING-2003), Harifax, Canada, 2003.#
[8] X. YU and W. LAM, “An Integrated Probabilistic and Logic Approach to Encyclopedia Relation Extraction with Multiple Features,” in Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, 2008, pp. 1065–1072.#
[9] T. Lee, Z. Wang, H. Wang, and S. Hwang, “Attribute extraction and scoring: A probabilistic approach,” in ICDE 2013, Brisbane, Australia, 2013, pp. 194–205.#
[10] F. M. Suchanek, G. Ifrim, and G. Weikum, “Combining Linguistic and Statistical Analysis to Extract Relations from Web Documents,” in Proceedings of KDD, Philadelphia, Pennsylvania, USA, 2006, pp. 712–717.#
[11] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J.-R. Wen, “StatSnowball : a Statistical Approach to Extracting Entity,” in Proceedings of the 18th international conference on World wide web, Madrid, Spain, 2009, pp. 101–110.#
[12] N. Bach and S. Badaskar, “A review of relation extraction,” Lit. Rev. Lang. Stat. II, 2007.#
[13] A. Sun, R. Grishman, and S. Sekine, “Semi-supervised relation extraction with large-scale word clustering,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA, 2011, pp. 521–529.#
[14] F. Xu, “Bootstrapping Relation Extraction from Semantic Seeds,” Saarland University, Saarbrücken, Germany, 2007.#
[15] F. Wu and D. S.Weld, “Autonomously Semantifying Wikipedia,” in Proceedings of CIKM’ 07, Lisboa, Portugal, 2007, pp. 41–50.#
[16] M. Mintz, S. Bills, R. Snow, and D. Jurafsky, “Distant supervision for relation extraction without labeled data,” in Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, Suntec , Singapore, 2009, pp. 1003–1011.#
[17] K. Eichler, H. Hemsen, and G. Neumann, “Unsupervised relation extraction from Web documents,” in Proceeding of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, 2008, pp. 1674–1679.#
[18] M. Banko, M. Cafarella, and S. Soderland, “Open information extraction from the web,” in International Joint Conferences on Artificial Intelligence, Hyderabad, India, 2007, pp. 2670–2676.#
[19] S. Soderland, B. Roof, B. Qin, and S. Xu, “Adapting Open Information Extraction to Domain-Specific Relations,” AI Mag., vol. 31, no. 3, pp. 93–102, 2010.#
[20] S. Soderland, J. Gilmer, R. Bart, O. Etzioni, and D. Weld, “Open Information Extraction to KBP Relations in 3 Hours,” in Proceedings of TAC-KBP 2013, Maryland, USA, 2013.#
[21] M. Yahya, S. E. Whang, R. Gupta, and A. Halevy, “ReNoun: Fact Extraction for Nominal Attributes,” in Proceedings of EMNLP 2014, Doha, Qatar, 2014, pp. 325–335.#
[22] R. C. Bunescu and R. J. Mooney, “A shortest path dependency kernel for relation extraction,” in Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, Canada, 2005, pp. 724–731.#
[23] M. Surdeanu, S. Harabagiu, J. Williams, and P. Aarseth, “Using predicate-argument structures for information extraction,” in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03, Morristown, NJ, USA, 2003, pp. 8–15.
[24] M. Gregory, L. Mcgrath, E. Bell, K. O. Hara, and K. Domico, “Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources,” in Twenty-Fourth International FLAIRS Conference, Palm Beach, Florida, USA, 2011, pp. 251–256.#
[25] P. Exner and P. Nugues, “Using semantic role labeling to extract events from Wikipedia,” in CEUR Workshop Proceedings, Bonn, Germany, 2011, pp. 38–47.#
[26] A. Moro and R. Navigli, “Integrating Syntactic and Semantic Analysis into the Open Information Extraction Paradigm,” in Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, Beijing, China, 2013, pp. 2148–2154.#
[27] C. Delli Bovi, L. Telesca, and R. Navigli, “Large-Scale Information Extraction from Textual Definitions through Deep Syntactic and Semantic Analysis,” Trans. Assoc. Comput. Linguist., vol. 3, pp. 529–543, 2015.#
[28] A. Moro, A. Raganato, and R. Navigli, “Entity Linking meets Word Sense Disambiguation : a Unified Approach,” Trans. Assoc. Comput. Linguist., vol. 2, pp. 231–244, 2014.#
[29] M. A. Heart, “Automatic Acquisition of Hyponyms from Large Text Corpora Lexico-Syntactic for Hyponymy Patterns,” in Proceedings of the 14th conference on Computational linguistics, Stroudsburg, PA, USA, 1992, pp. 539–545.
[30] H. Emami, H. Shirazi, A. A. Barforoush, and M. Hourali, “A Pattern-Matching Method for Extracting Personal Information in Farsi Content,” U.P.B. Sci. Bull., Ser. C, vol. 78, no. 1, pp. 125–138, 2016.#
[31] R. Al-Rfou, V. Kulkarni, B. Perozzi, and S. Skiena, “Polyglot-NER: Massive Multilingual Named Entity Recognition,” in Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, British Columbia, Canada, 2015, pp. 586–594.#
[32] E. Minkov, R. C. Wang, and W. W. Cohen, “Extracting Personal Names from Email : Applying Named Entity Recognition to Informal Text,” Comput. Linguist., pp. 443–450, 2005.#
[33] Y. Chen, S. Y. Mei Lee, and C. R. Huang, “A robust web personal name information extraction system,” Expert Syst. Appl., vol. 39, no. 3, pp. 2690–2699, 2012.#
[34] Z. M. Arani and A. Abdollahzadeh Barforoush, “Semantic Role Labeling using Syntactic Dependency Analysis and Noun Semantic Catergory,” in 20th Annual Conference of Computer Society of Iran, Mashhad, Iran (In Farsi), 2015, pp. 619–624.#
[35] K. Kipper, A. Korhonen, N. Ryant, and M. Palmer, “A large-scale classification of English verbs,” Lang. Resour. Eval., vol. 42, no. 1, pp. 21–40, 2008.#
[36] H. Mohagheghiyan, “Comparison of Persian Syntactic Dependency Parsers,” Vali-e-Asr University of Rafsanjan, Rafsanjan, Iran, 2015.#
[37] M. Surdeanu, “Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling,” in Proceedings of the Sixth Text Analysis Conference (TAC 2013), Maryland, USA, 2013.#
[38] M. Surdeanu and H. Ji., “Overview of the English Slot Filling Track at the TAC2014 Knowledge Base Population Evaluation,” in Proceedings of Text Analysis Conference (TAC2014), Maryland, USA, 2014.#
[39] C. Fellbaum, “WordNet: An Electronic Lexical Database,” MIT Press, 1998.#
[40] M. Shamsfard, A. Hesabi, H. Fadaei, N. Mansoory, A. Famian, S. Bagherbeigi, E. Fekri, M. Monshizadeh, and S. M. Assi, “Semi Automatic Development Of FarsNet: The Persian Wordnet,” in Proceedings of 5th Global WordNet Conference, Mumbai, India, 2010.#
[41] C. J. Fillmore, C. R. Johnson, and M. R. L. Petruck, “Background to FrameNet,” Int. J. Lexicogr., vol. 16, no. 3, pp. 1–28, 2002.#
[42] C. Bonial, K. Stowe, and M. Palmer, “Renewing and revising SemLink,” in The GenLex Workshop on Linked Data in Linguistics, Pisa, Italy, 2013, pp. 9–17.#
[43] G. Angeli, A. Chaganty, A. Chang, K. Reschke, J. Tibshirani, J. Y. Wu, O. Bastani, K. Siilats, and C. D. Manning, “Stanford’s 2013 KBP System,” in Proceedings of the Sixth Text Analysis Conference (TAC2013), Maryland, USA, 2013.#
[44] J. Christensen, S. Soderland, and O. Etzioni, “Semantic Role Labeling for Open Information Extraction,” in Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading, Los Angeles, California, 2010, pp. 52–60.#
[45] A. Fader, S. Soderland, and O. Etzioni, “Identifying relations for open information extraction,” in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, UK, 2011, pp. 1535–1545.
[46] J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977.#
[47] D. M. W. Powers, “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation,” J. Mach. Learn. Technol., vol. 2, no. 1, pp. 37–63, 2011.#
[48] K. Balog, J. He, C. Monz, M. Tsagkias, K. Hofmann, V. Jijkoun, W. Weerkamp, and M. De Rijke, “The University of Amsterdam at WePS2,” in 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, Madrid, Spain, 2009.#
[49] Y. Chen, S. Lee, and C. Huang, “Polyuhk: A robust information extraction system for web personal names,” 2nd Web People Search Eval. Work. (WePS 2009), 18th WWW Conf. Madrid, Spain, 2009.#
[50] J. Artiles, J. Gonzalo, and S. Sekine, “Weps 2 evaluation campaign: overview of the web people search clustering task,” in 2nd Web People Search Evaluation Workshop (WePS 2009), 18th WWW Conference, Madrid, Spain, 2009.#
[51] I. Nagy, “Person Attribute Extraction from the Textual Parts of Web Pages,” Acta Cybern., vol. 20, no. 3, pp. 419–440, 2012.#