Word Sense Induction in Persian and English: A Comparative Study
محورهای موضوعی : Natural Language Processing
1 - Institute for Humanities and Cultural Studies
کلید واژه: Corpus Linguistics, Word Sense Induction, Parametric Clustering, Non-Parametric Clustering, Unsupervised machine learning,
چکیده مقاله :
Words in the natural language have forms and meanings, and there might not always be a one-to-one match between them. This property of the language causes words to have more than one meaning; as a result, a text processing system faces challenges to determine the precise meaning of the target word in a sentence. Using lexical resources or lexical databases, such as WordNet, might be a help, but due to their manual development, they become outdated by passage of time and language change. Moreover, the lexical resources might be domain dependent which are unusable for open domain natural language processing tasks. These drawbacks are a strong motivation to use unsupervised machine learning approaches to induce word senses from the natural data. To reach the goal, the clustering approach can be utilized such that each cluster resembles a sense. In this paper, we study the performance of a word sense induction model by using three variables: a) the target language: in our experiments, we run the induction process on Persian and English; b) the type of the clustering algorithm: both parametric clustering algorithms, including hierarchical and partitioning, and non-parametric clustering algorithms, including probabilistic and density-based, are utilized to induce senses; c) the context of the target words to capture the information in vectors created for clustering: for the input of the clustering algorithms, the vectors are created either based on the whole sentence in which the target word is located; or based on the limited surrounding words of the target word. We evaluate the clustering performance externally. Moreover, we introduce a normalized, joint evaluation metric to compare the models. The experimental results for both Persian and English test data showed that the window-based partitioningK-means algorithm obtained the best performance.
Words in the natural language have forms and meanings, and there might not always be a one-to-one match between them. This property of the language causes words to have more than one meaning; as a result, a text processing system faces challenges to determine the precise meaning of the target word in a sentence. Using lexical resources or lexical databases, such as WordNet, might be a help, but due to their manual development, they become outdated by passage of time and language change. Moreover, the lexical resources might be domain dependent which are unusable for open domain natural language processing tasks. These drawbacks are a strong motivation to use unsupervised machine learning approaches to induce word senses from the natural data. To reach the goal, the clustering approach can be utilized such that each cluster resembles a sense. In this paper, we study the performance of a word sense induction model by using three variables: a) the target language: in our experiments, we run the induction process on Persian and English; b) the type of the clustering algorithm: both parametric clustering algorithms, including hierarchical and partitioning, and non-parametric clustering algorithms, including probabilistic and density-based, are utilized to induce senses; c) the context of the target words to capture the information in vectors created for clustering: for the input of the clustering algorithms, the vectors are created either based on the whole sentence in which the target word is located; or based on the limited surrounding words of the target word. We evaluate the clustering performance externally. Moreover, we introduce a normalized, joint evaluation metric to compare the models. The experimental results for both Persian and English test data showed that the window-based partitioningK-means algorithm obtained the best performance.
[1] F. de Saussure, Cours de linguistique générale, C. Bally, A. Sechehaye, and A. Riedlinger, Eds. Lausanne, Paris: Payot, 1916.
[2] J. Lyons, Language and Linguistics: An Introduction. Cambridge, UK: Cambridge University Press, 1981.
[3] L. Wittgenstein, Philosophical Investigations. Oxford, UK: Blackwell Publishing Ltd, 1953.
[4] Z. S. Harris, “Distributional structure,” Word, vol. 23, no. 10, pp. 146–162, 1954.
[5] J. R. Firth, “A synopsis of linguistic theory 1930-1955,” Studies in Linguistic Analysis (special volume of the Philological Society), pp. 1–32, 1957.
[6] G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language and Cognitive Processes, vol. 6, no. 1, pp. 1–28, 1991.
[7] Y. Peirsman and D. Geeraerts, “Predicting strong associations on the basis of corpus data,” in Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, pp. 648–656.
[8] T. K. Landauer and S. T. Dumais, “A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge,” Psychological Review, vol. 104, no. 2, pp. 211–240, 1997.
[9] M. Sahlgren, The Word-space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-dimensional Vector Spaces. Ph.D. dissertation, Stockholm University, Stockholm, Sweden, 2006.
[10] Z. S. Harris, A Theory of Language and Information: A Mathematical Approach. Oxford, England: Oxford University Press, 1991.
[11] D. Lin, “Automatic retrieval and clustering of similar words,” in Proceedings of the 17th international conference on Computational linguistics. Morristown, NJ, USA: Association for Computational Linguistics, 1998, pp. 768–774.
[12] S. Padó and M. Lapata, “Dependency-based construction of semantic space models,” Computational Linguistics, vol. 33, no. 2, pp. 161–199, June 2007.
[13] O. Levy and Y. Goldberg, “Dependency-based word embeddings,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2. Baltimore, Maryland: Association for Computational Linguistics, June 2014, pp. 302–308.
[14] K. M. Hermann and P. Blunsom, “The role of syntax in vector space models of compositional semantics,” in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 1, Sofia, Bulgaria, 2013, pp. 894–904.
[15] L. Song, Z. Wang, H. Mi, and D. Gildea, “Sense embedding learning for word sense induction,” in Proceedings of the 5th Joint Conference on Lexical and Computational Semantics. The *SEM 2016 Organizing Committee, 2016, pp. 85–90.
[16] D. M. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
[17] D. Jurafsky and J. H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2020, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf.
[18] S. K. Jauhar, C. Dyer, and E. Hovy, “Ontologically grounded multi-sense representation learning for semantic vector space models,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics, May 2015, pp. 683–693.
[19] S. Rothe and H. Schütze, “AutoExtend: Extending word embeddings to embeddings for synsets and lexemes,” in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1. Beijing, China: Association for Computational Linguistics, July 2015, pp. 1793–1803.
[20] S. Ramprasad and J. Maddox, “CoKE: Word sense induction using contextualized knowledge embeddings,” in Proceedings of the AAAI 2019 Spring Symposium on Combining Machine Learning with Knowledge Engineering, 2019.
[21] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, vol. 14, 2014, pp. 1532–1543.
[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proceedings of the 26th International Conference on Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111–3119.
[23] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1. Berkeley, California: University of California Press, 1967, pp. 281–297.
[24] E. Huang, R. Socher, C. D. Manning, and A. Ng, “Improving word representations via global context and multiple word prototypes,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, vol. 1. Jeju Island, Korea: Association for Computational Linguistics, July 2012, pp. 873–882.
[25] M. Ghayoomi, “Density measure in context clustering for distributional semantics of word sense induction,” Journal of Information Systems and Telecommunication, vol. 8, no. 1, pp. 15–24, 2020.
[26] P. J. Rousseeuw, “Silhouettes: A graphical aid to the interpretation and validation of cluster analysis,” Journal of Computational and Applied Mathematics, vol. 20, no. 1, pp. 53–65, November 1987.
[27] D. M. Blei, M. I. Jordan, T. L. Griffiths, and J. B. Tenenbaum, “Hierarchical topic models and the nested Chinese restaurant process,” in Proceedings of the 16th International Conference on Neural Information Processing Systems. MIT Press, 2003, pp. 17–24.
[28] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, E. Simoudis, J. Han, and U. M. Fayyad, Eds. AAAI Press, 1996, pp. 226–231.
[29] A. Neelakantan, J. Shankar, A. Passos, and A. McCallum, “Efficient nonparametric estimation of multiple embeddings per word in vector space,” in Processing of the Conference on Empirical Methods in Natural Language. Doha, Qatar: Association for Computational Linguistics, 2014.
[30] J. Li and D. Jurafsky, “Do multi-sense embeddings improve natural language understanding?” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2015, pp. 1722–1732.
[31] J. Wang, M. Bansal, K. Gimpel, B. D. Ziebart, and C. T. Yu, “A sensetopic model for word sense induction with unsupervised data enrichment,” Transactions of the Association for Computational Linguistics, vol. 3, pp. 59–71, 2015.
[32] A. Amrami and Y. Goldberg, “Word sense induction with neural biLM and symmetric patterns,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 4860–4867. [33] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1. New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 2227–2237.
[34] D. Alagić, J. Šnajder, and S. Padó, “Leveraging lexical substitutes for unsupervised word sense induction,” in Proceedings of the 32nd Conference of the Association for the Advancement of Artificial Intelligence. New Orleans, LA, 2018.
[35] E. A. Corrêa and D. R. Amancio, “Word sense induction using word embeddings and community detection in complex networks,” Physica A: Statistical Mechanics and its Applications, vol. 523, pp. 180–190, 2019.
[36] B. Perozzi, R. Al-Rfou’, V. Kulkarni, and S. Skiena, “Inducing language networks from continuous space word representations,” in Complex Networks, P. Contucci, R. Menezes, A. Omicini, and J. Poncela-Casasnovas, Eds. Cham: Springer International Publishing, 2014, pp. 261–273.
[37] P. T. Tallo, Using Sentence Embeddings for Word Sense Induction. Master’s thesis, Electrical Engineering and Computer Science, University of Cincinnati, Ohio, USA, 2020.
[38] Q. Dong and Y. Wang, “Enhancing medical word sense inventories using word sense induction: A preliminary study,” in Proceedings of the 6th International Workshop on Data Management and Analytics for Medicine and Healthcare, in conjunction with the 46th International Conference on Very Large Data Bases, 2020, pp. 151–167.
[39] S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, “Linear algebraic structure of word senses, with applications to polysemy,” Transactions of the Association for Computational Linguistics, vol. 6, pp. 483–495, 2018.
[40] S. Manandhar, I. P. Klapaftis, D. Dligach, and S. S. Pradhan, “Semeval-2010 task 14: Word sense induction & disambiguation,” in Proceedings of the 5th International Workshop on Semantic Evaluation. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 63–68.
[41] G. M. Salton, A. Wong, and C.-S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, vol. 18, no. 11, pp. 613–620, November 1975.
[42] M. Ghayoomi, “Finding the meaning of Persian words automatically using word embedding,” Iranian Journal of Information Processing & Management, vol. 35, no. 1, pp. 25–50, 2019.
[43] S. Assi, “Farsi linguistic database (FLDB),” International Journal of Lexicography, vol. 10, no. 3, p. 5, 1997.
[44] A. AleAhmad, H. Amiri, E. Darrudi, M. Rahgozar, and F. Oroumchian, “Hamshahri: A standard Persian text collection,” Knowledge-Based Systems, vol. 22, no. 5, pp. 382–387, 2009.
[45] M. Bijankhan, “naqše peykarehāye zabāni dar neveštane dasture zabān: mo‘arrefiye yek narmafzāre rāyāneyi [“The role of corpora in writing a grammar: Introducing a software”],” Journal of Linguistics, vol. 19, no. 2, pp. 48–67, 2004.
[46] M. Bijankhan, J. Sheykhzadegan, M. Bahrani, and M. Ghayoomi, “Lessons from building a Persian written corpus: Peykare,” Language Resources and Evaluation, vol. 45, no. 2, pp. 143–164, 2011.
[47] C. Shaoul and C. Westbury, “The Westbury Lab Wikipedia Corpus,” 2010.
http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html.
[48] M. Shamsfard, H. S. Jafari, and M. Ilbeygi, “STeP-1: A set of fundamental tools for Persian text processing,” in Proceedings of the 7th International Conference on Language Resources and Evaluation, N.Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, and D. Tapias, Eds. Valletta, Malta: European Language Resources Association (ELRA), May 19–21 2010, pp. 859–865.
[49] C. J. V. Rijsbergen, Information Retrieval, 2nd ed. Newton, MA, USA: Butterworth-Heinemann, 1979.
[50] A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-based external cluster evaluation measure,” in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Prague, Czech Republic: Association for Computational Linguistics, June 2007, pp. 410–420.
[51] B. E. Dom, An Information-theoretic External Cluster-validity Measure. IBM, Tech. Rep., 2001.
[52] M. Meilă, “Comparing clusterings – an information based distance,” Journal of Multivariate Analysis, vol. 98, no. 5, pp. 873–895, May 2007.
[53] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186.