SGF (Semantic Graphs Fusion): A Knowledge-based Representation of Textual Resources for Text Mining Applications
محورهای موضوعی : Natural Language ProcessingMorteza Jaderyan 1 , Hassan Khotanlou 2
1 - Bu Ali SIna University
2 - Bu Ali Sina Uinversity
کلید واژه: Semantic document representation, , Ontology, , Knowledge base (KB), , Semantic network, , Information fusion, ,
چکیده مقاله :
The proper representation of textual documents has been the greatest challenge in text mining applications. In this paper, a knowledge-based representation model for text analysis applications is introduced. The proposed functionalities of the system are achieved by integrating structured knowledge in the core components of the system. The semantic, lexical, syntactical and structural features are identified by the pre-processing module. The enrichment module is introduced to identify contextually similar concepts and concept maps for improving the representation. The information content of documents and the enriched contents are then fused (merged) into the graphical structure of a semantic network to form a unified and comprehensive representation of documents. The 20Newsgroup and Reuters-21578 datasets are used for evaluation. The evaluation results suggest that the proposed method exhibits a high level of accuracy, recall and precision. The results also indicate that even when a small portion of the information content is available, the proposed method performs well in standard text mining applications
The proper representation of textual documents has been the greatest challenge in text mining applications. In this paper, a knowledge-based representation model for text documents is introduced. The system works by integrating structured knowledge in the core components of the system. Semantic, lexical, syntactical and structural features are identified by the pre-processing module. The enrichment module is introduced to identify contextually similar concepts and concept maps for improving the representation. The information content of documents and the enriched contents are fused (merged) into the graphical structure of semantic network to form a unified and comprehensive representation of documents. The 20Newsgroup and Reuters-21578 dataset are used for evaluation. The evaluation results suggest that the proposed method exhibits a high level of accuracy, recall and precision. The results also indicate that even when a small portion of information content is available, the proposed method performs well in standard text mining applications.
[1] M. Fernández, I. Cantador, V. López, D. Vallet, Pablo Castells, E. Motta, “Semantically enhanced Information Retrieval: An ontology-based approach”, Web Semantics: Science, Services and Agents on the World Wide Web 9, 434–452, 2011.
[2] M. R. Bouadjeneka, H. Hacidc, M. Bouzeghoubd, “Social networks and information retrieval, how are they converging? A survey, a taxonomyand an analysis of social information retrieval approaches and platforms”, Information Systems, Vol. 56, 1-18, 2016.
[3] B. Steichen, H. Ashman, V. Wade, “A comparative survey of Personalized Information Retrieval and Adaptive Hypermedia techniques”, Information Processing and Management, Vol. 48, 698–724, 2012.
[4] S. Kara, Ö. Alan, O. Sabuncu, S. Akpınar, N. K. Cicekli, F.N. Alpaslan, “An ontology-based retrieval system using semantic indexing”, Information Systems, Vol. 37, 294-305, 2012.
[5] A. N. Jamgade, and J. K. Shivkumar, "Ontology based information retrieval system for Academic Library." International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), IEEE, 2015.
[6] T. Roelleke, “Synthesis Lectures on Information Concepts, Retrieval, and Services”, Morgan & Claypool Publishers, 2013.
[7] Z. Hengxiang, J. Lafferty, “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval”, SIGIR Forum, Vol. 51, 268-276, 2017.
[8] K.M. Kim, J.H. Hong, S.B. Cho, “A semantic Bayesian network approach to retrieving information with intelligent conversational agents”, Information Processing & Management,Vol.43,225–236, 2007.
[9] Y. Bassil, P. Semaan, “Semantic-Sensitive Web Information Retrieval Model for HTML Documents”, European Journal of Scientific Research, Vol. 69, 1-11, 2012.
[10] S.N. B. Bhushan, A. Danti, “Classification of text documents based on score level fusion approach”, Pattern Recognition Letters, Vol. 94, 118-126, 2017.
[11] F. Ramli, S. A. Noah, T. B. Kurniawan, "Ontology-based information retrieval for historical documents", 2016 Third International Conference on Information Retrieval and Knowledge Management (CAMP), 2016.
[12] M. Daoud, L. Tamine, M. Boughanem, “A personalized search using a semantic distance measure in a graph-based ranking model”, Journal of Information Science, Vol. 37, 614–636, 2011.
[13] D. Laura, A. Kotov, and E. Meij, "Utilizing knowledge bases in text-centric information retrieval", In Proceedings of ACM International Conference on the Theory of Information Retrieval., 2016.
[14] M. Banko, O. Etzioni, “The tradeoffs between open and traditional relation extraction”, in Proceedings of ACL-08: HLT, Association for Computational Linguistics, 2008.
[15] B. Mitra, N. Craswel, “Neural Text Embeddings for Information Retrieval”, In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM '17, 2017.
[16] F. Gutierrez, D. Dejing, F. Stephen, W. Daya, Z. Hui. "A hybrid ontology-based information extraction system", Journal of Information Science, Vol. 42, 798-820, 2016.
[17] Y. Gupta, A. Saini, A.K. Saxena, “A new fuzzy logic based ranking function for efficient Information Retrieval system”, Expert Systems with Applications, Vol. 42, 1223-1234, 2015.
[18] M. Daoud, L. Tamine, M. Boughanem, “Towards a graph based user profile modeling for a session-based personalized search”, Knowledge and Information Systems Vol. 21, 365-398, 2009.
[19] G-J. Hahm, J-H. Lee, H-W. Suh, “Semantic relation based personalized ranking approach for engineering document retrieval”, Advanced Engineering Informatics, Vol. 29, 366-379, 2015.
[20] Z. Wu, H. Zhu, G. Li, Z. Cui, H. Huang, J. Li, E. Chen, G. Xu, “An efficient Wikipedia semantic matching approach to text document classification”, Information Sciences, Vol. 393, 15-28, 2017.
[21] J. Yun, L. Jing, J. Yu, H. Huang, “A multi-layer text classification framework based on two-level representation model”, Expert Systems with Applications, Vol. 39, 2035-2046, 2012.
[22] C. Jiang, F. Coenen, R. Sanderson, M. Zito, “Text classification using graph mining-based feature extraction”, Knowledge Based Systems, vol. 23, 302-308, 2010.
[23] H. K. Kim, H. Kim, and S. Cho, “Bag-of-concepts: Comprehending document representation through clustering words in distributed representation.”, Neurocomputing, vol. 266, 336-352, 2017.
[24] W. Jin, Z. wang, D. zhang, J. Yan, "Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification.", Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017.
[25] Y. Li, B. Wei, Y. Liu, L. Yao, H. Chen, J. Yu, W. Zhu, “Incorporating Knowledge into neural network for text representation”, Expert Systems With Applications, In Press - Accepted Manuscript, 2017.
[26]
[28] P. Kolb, “DISCO: A Multilingual Database of Distribution-ally Similar Words”, In Proceedings of 9th Conference in Natural Language, 2008.
[29] B. T. McInnes, T. Pedersen, “Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text”, Journal of Biomedical Informatics, Vol. 46, 1116-1124, 2013.
[30] S. Pyysalo, “Part-of-Speech Tagging”, In: Dubitzky W., Wolkenhauer O., Cho KH., Yokota H. (eds) Encyclopedia of Systems Biology, Springer, 2013.
[31] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, D. McClosky.,”The Stanford CoreNLP Natural Language Processing Toolkit”, In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, System Demonstrations, 2014.
[32] J. Hakenberg, “Named Entity Recognition”, In: Dubitzky W., Wolkenhauer O., Cho KH., Yokota H. (eds) Encyclopedia of Systems Biology, Springer, 2013.
[33] B. Mohit, “Named Entity Recognition. In: Zitouni I. (eds) Natural Language Processing of Semitic Languages”, Theory and Applications of Natural Language Processing, Springer, 2014.
[34] J. Vilares, M. A. Alonso, M. Vilares, “Extraction of complex index terms in non-English IR: A shallow parsing based approach”, Information Processing & Management, Vol. 44, 1517-1537, 2008.
[35] S.K. Saritha, R.K. Pateriya, “Rule-Based Shallow Parsing to Identify Comparative Sentences from Text Documents”, In: Shetty N., Prasad N., Nalini N. (eds) Emerging Research in Computing, Information, Communication and Applications, Springer, 2016.
[36] M. Baziz, M. Boughanem, S. Traboulsi, “A Concept-based Approach for Indexing in IR”, in the proceedings of INFORSID05, 2005. [37] C. Biemann, S. P. Ponzetto, S. Faralli, A. Panchenko, and E. Ruppert, “Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction and Disambiguation.,” in EACL, 2017.
[38] W. Cohen , P. Ravikumar, S. Fienberg, “A comparison of string distance metrics for name-matching tasks”, American Association for Artificial Intelligence, 73-78, 2003.
[39] Lang, K. “The 20 Newsgroups data set, version 20news-18828”, [last update on Jan 14, 2017], [Online] Available:
[40] N. Seco, T. Veale, J. Hayes., “An Intrinsic Information Content Metric for Semantic Similarity in WordNet”, In Proceedings of European Chapter of the Association for Computational Linguistics, 2004.
[41] W. Zhang, X. Tang, T. Yoshida, “TESC: An approach to TExt classification using Semi-supervised Clustering”, Knowledge-Based Systems, Vol. 75, pp. 152-160, 2015.
[42] S. Langer, J. Beel, “Apache Lucene as Content-Based-Filtering Recommender System: 3 Lessons Learned.”, 5th International Workshop on Bibliometric-enhanced Information Retrieval, 2017.
[43] R. Song, S. Chen, B. Deng, and L. Li, “eXtreme Gradient Boosting for Identifying Individual Users Across Different Digital Devices”, In Proceedings of WAIM, Vol. 9658, pp. 43–54, 2016.
[44] Q. Wu, Y. Ye, H. Zhang, M. Ng and S. Ho, “ ForesTexter: An efficient random forest algorithm for imbalanced text categorization”, Knowledge-Based Systems, Vol. 67, pp.105-116, 2014.
[45] G. Rao, W. Huang, Z. Feng and Q. Cong, “LSTM with sentence representations for document-level sentiment classification”, Neurocomputing, Vol. 308, pp.49-57, 2018.
[46] C . Olah, “Understanding LSTM Networks”, [last update on Aug 27, 2015], [Online] Available:
[47] P. Srivastava, “Essentials of Deep Learning : Introduction to Long Short Term Memory”, [last update on Dec 10, 2017], [Online] Available: < https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/ >, [Retrieved on Nov 01, 2018.