Farsi Conceptual Text Summarizer: A New Model in Continuous Vector Space
محورهای موضوعی : Natural Language ProcessingMohammad Ebrahim Khademi 1 , Mohammad Fakhredanesh 2 , Seyed Mojtaba Hoseini 3
1 - Malek Ashtar University of Technology
2 - Malek Ashtar University of Technology
3 - Malek Ashtar University of Technology
کلید واژه: Extractive Text Summarization , Unsupervised Learning , Language Independent Summarization,
چکیده مقاله :
Traditional methods of summarization were very costly and time-consuming. This led to the emergence of automatic methods for text summarization. Extractive summarization is an automatic method for generating summary by identifying the most important sentences of a text. In this paper, two innovative approaches are presented for summarizing the Persian texts. In these methods, using a combination of deep learning and statistical methods, we cluster the concepts of the text and, based on the importance of the concepts in each sentence, we derive the sentences that have the most conceptual burden. In the first unsupervised method, without using any hand-crafted features, we achieved state-of-the-art results on the Pasokh single-document corpus as compared to the best supervised Persian methods. In order to have a better understanding of the results, we have evaluated the human summaries generated by the contributing authors of the Pasokh corpus as a measure of the success rate of the proposed methods. In terms of recall, these have achieved favorable results. In the second method, by giving the coefficient of title effect and its increase, the average ROUGE-2 values increased to 0.4% on the Pasokh single-document corpus compared to the first method and the average ROUGE-1 values increased to 3% on the Khabir news corpus.
Traditional methods of summarization were very costly and time-consuming. This led to the emergence of automatic methods for text summarization. Extractive summarization is an automatic method for generating summary by identifying the most important sentences of a text. In this paper, two innovative approaches are presented for summarizing the Persian texts. In these methods, using a combination of deep learning and statistical methods, we cluster the concepts of the text and, based on the importance of the concepts in each sentence, we derive the sentences that have the most conceptual burden. In the first unsupervised method, without using any hand-crafted features, we achieved state-of-the-art results on the Pasokh single-document corpus as compared to the best supervised Persian methods. In order to have a better understanding of the results, we have evaluated the human summaries generated by the contributing authors of the Pasokh corpus as a measure of the success rate of the proposed methods. In terms of recall, these have achieved favorable results. In the second method, by giving the coefficient of title effect and its increase, the average ROUGE-2 values increased to 0.4% on the Pasokh single-document corpus compared to the first method and the average ROUGE-1 values increased to 3% on the Khabir news corpus.
[1] P. B. Baxendale, “Machine-made index for technical literature—an experiment,” IBM J. Res. Dev., vol. 2, no. 4, pp. 354–361, 1958.
[2] H. P. Edmundson, “New methods in automatic extracting,” J. ACM JACM, vol. 16, no. 2, pp. 264–285, 1969.
[3] H. P. Luhn, “The automatic creation of literature abstracts,” IBM J. Res. Dev., vol. 2, no. 2, pp. 159–165, 1958.
[4] H. Khanpour, “Sentence extraction for summarization and notetaking,” University of Malaya, 2009.
[5] W. Song, L. C. Choi, S. C. Park, and X. F. Ding, “Fuzzy evolutionary optimization modeling and its applications to unsupervised categorization and extractive summarization,” Expert Syst. Appl., vol. 38, no. 8, pp. 9112–9121, 2011.
[6] F. Jin, M. Huang, and X. Zhu, “A comparative study on ranking and selection strategies for multi-document summarization,” in Proceedings of the 23rd International Conference on Computational Linguistics: Posters, 2010, pp. 525–533.
[7] G. A. Miller, “WordNet: a lexical database for English,” Commun. ACM, vol. 38, no. 11, pp. 39–41, 1995.
[8] M. Shamsfard, “Developing FarsNet: A lexical ontology for Persian,” in 4th Global WordNet Conference, Szeged, Hungary, 2008.
[9] M. Shamsfard et al., “Semi automatic development of farsnet; the persian wordnet,” in Proceedings of 5th global WordNet conference, Mumbai, India, 2010, vol. 29.
[10] G. E. Hinton, J. L. Mcclelland, and D. E. Rumelhart, Distributed representations, Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations. MIT Press, Cambridge, MA, 1986.
[11] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic language model,” J. Mach. Learn. Res., vol. 3, no. Feb, pp.
1137–1155, 2003.
[12] P. D. Turney and P. Pantel, “From frequency to meaning: Vector space models of semantics,” J. Artif. Intell. Res., vol. 37, pp. 141–188, 2010.
[13] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” science, vol. 290, no. 5500, pp. 2323–2326, 2000.
[14] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” science, vol. 290, no. 5500, pp. 2319–2323, 2000.
[15] H. Schwenk, “Continuous space language models,” Comput. Speech Lang., vol. 21, no. 3, pp. 492–518, 2007.
[16] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” J. Mach. Learn. Res., vol. 12, no. Aug, pp. 2493–2537, 2011.
[17] R. Collobert and J. Weston, “A unified architecture for natural language processing: Deep neural networks with multitask learning,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 160–167.
[18] J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. M. Schwartz, and J. Makhoul, “Fast and Robust Neural Network Joint Models for Statistical Machine Translation.,” in ACL (1), 2014, pp. 1370–1380.
[19] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
[20] Z. Chen et al., “Revisiting Word Embedding for Contrasting Meaning.,” in ACL (1), 2015, pp. 106–115.
[21] T. Mikolov, A. Deoras, S. Kombrink, L. Burget, and J. Černockỳ, “Empirical evaluation and combination of advanced language modeling techniques,” in Twelfth Annual Conference of the International Speech Communication Association, 2011.
[22] M. Hassel and N. Mazdak, “FarsiSum: a Persian text summarizer,” in Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, 2004, pp. 82–84.
[23] A. Zamanifar, B. Minaei-Bidgoli, and M. Sharifi, “A new hybrid farsi text summarization technique based on term co-occurrence and conceptual property of the text,” in Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, 2008.SNPD’08. Ninth ACIS International Conference on, 2008, pp. 635–639.
[24] M. Shamsfard, T. Akhavan, and M. E. Joorabchi, “Persian document summarization by PARSUMIST,” World Appl. Sci. J., vol. 7, pp. 199–205, 2009.
[25] A. Zamanifar and O. Kashefi, “AZOM: a Persian structured text summarizer,” Nat. Lang. Process. Inf. Syst., pp. 234–237, 2011.
[26] F. Shafiee and M. Shamsfard, “Similarity versus relatedness: A novel approach in extractive Persian document summarisation,” J. Inf. Sci., p. 0165551517693537, 2017.
[27] H. Shakeri, S. Gholamrezazadeh, M. A. Salehi, and F. Ghadamyari, “A new graph-based algorithm for Persian text summarization,” in Computer science and convergence, Springer, 2012, pp. 21–30.
[28] T. Hosseinikhah, A. Ahmadi, and A. Mohebi, “A new Persian Text Summarization Approach based on Natural Language Processing and Graph Similarity,” Iran. J. Inf. Process. Manag., vol. 33, no. 2, pp. 885–914, 2018.
[29] F. Kiyomarsi and F. R. Esfahani, “Optimizing persian text summarization based on fuzzy logic approach,” in 2011 International Conference on Intelligent Building and Management, 2011.
[30] M. Tofighy, O. Kashefi, A. Zamanifar, and H. H. S. Javadi, “Persian text summarization using fractal theory,” in International Conference on Informatics Engineering and Information Science, 2011, pp. 651–662.
[31] M. Bazghandi, G. T. Tabrizi, M. V. Jahan, and I. Mashahd, “Extractive Summarization Of Farsi Documents Based On PSO Clustering,” jiA, vol. 1, p. 1, 2012.
[32] S. M. Tofighy, R. G. Raj, and H. H. S. Javad, “AHP techniques for Persian text summarization,” Malays. J. Comput. Sci., vol. 26, no. 1, pp. 1–8, 2013.
[33] P. Asef, K. Mohsen, T. S. Ahmad, E. Ahmad, and Q. Hadi, “IJAZ: AN OPERATIONAL SYSTEM FOR SINGLE-DOCUMENT SUMMARIZATION OF PERSIAN NEWS TEXTS,” vol. 0, no. 121, pp. 33–48, Jan. 2014.
[34] T. Strutz, Data fitting and uncertainty: A practical introduction to weighted least squares and beyond. Vieweg and Teubner, 2010.
[35] B. B. Moghaddas, M. Kahani, S. A. Toosi, A. Pourmasoumi, and A. Estiri, “Pasokh: A standard corpus for the evaluation of Persian text summarizers,” in Computer and Knowledge Engineering (ICCKE), 2013 3th International eConference on, 2013, pp. 471–475.
[36] S. Farzi and S. Kianian, “Katibeh: A Persian news summarizer using the novel semi-supervised approach,” Digit. Scholarsh. Humanit., vol. 34, no. 2, pp. 277–289, 2018.
[37] M. A. Honarpisheh, G. Ghassem-Sani, and S. A. Mirroshandel, “A Multi-Document Multi-Lingual Automatic Summarization System.,” in IJCNLP, 2008, pp. 733–738.
[38] A. Joshi, E. Fidalgo, E. Alegre, and L. Fernández-Robles, “SummCoder: An unsupervised framework for extractive text summarization based on deep auto-encoders,” Expert Syst. Appl., vol. 129, pp. 200–215, 2019.
[39] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: A recurrent neural network based sequence model for extractive summarization of documents,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[40] A. AleAhmad, H. Amiri, E. Darrudi, M. Rahgozar, and F. Oroumchian, “Hamshahri: A standard Persian text collection,” Knowl.-Based Syst., vol. 22, no. 5, pp. 382–387, 2009.
[41] hazm: Python library for digesting Persian text. Sobhe, 2017.
[42] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. 2013, pp. 3111–3119.
[43] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive datasets. Cambridge university press, 2014.
[44] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out: Proceedings of the ACL-04 workshop, 2004, vol. 8.
[45] ROUGE-2.0: Java implementation of ROUGE for evaluation of summarization tasks. Stemming, stopwords and unicode support. 2017.
[46] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: Bringing order to the web.,” Stanford InfoLab, 1999.
[47] J. Carbonell and J. Goldstein, “The use of MMR, diversity-based reranking for reordering documents and producing summaries,” in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp. 335–336.