Ranking Improvement Using BERT
Subject Areas : electrical and computer engineeringshekoofe bostan 1 , Ali-Mohammad Zare-Bidoki 2 , Mohammad-Reza Pajoohan 3
1 - University of Yazd
2 - Department of Computer Engineering, Yazd University, Iran
3 -
Keywords: Word embedding, BERT, semantic vector, query, ranking,
Abstract :
In today's information age, efficient document ranking plays a crucial role in information retrieval systems. This article proposes a new approach to document ranking using embedding models, with a focus on the BERT language model to improve ranking results. The proposed approach uses vocabulary embedding methods to represent the semantic representations of user queries and document content. By converting textual data into semantic vectors, the relationships and similarities between queries and documents are evaluated under the proposed ranking relationships with lower cost. The proposed ranking relationships consider various factors to improve accuracy, including vocabulary embedding vectors, keyword location, and the impact of valuable words on ranking based on semantic vectors. Comparative experiments and analyses were conducted to evaluate the effectiveness of the proposed relationships. The empirical results demonstrate the effectiveness of the proposed approach in achieving higher accuracy compared to common ranking methods. These results indicate that the use of embedding models and their combination in proposed ranking relationships significantly improves ranking accuracy up to 0.87 in the best case. This study helps improve document ranking and demonstrates the potential of the BERT embedding model in improving ranking performance.
[1] Y. Yum, et al., "A word pair dataset for semantic similarity and relatedness in Korean medical vocabulary: reference development and validation," JMIR Medical Informatics, vol. 9, no. 6, Article ID: e29667, Jun. 2021.
[2] E. Hindocha, V. Yazhiny, A. Arunkumar, and P. Boobalan, "Short-text semantic similarity using GloVe word embedding," International Research J. of Engineering and Technology, vol. 6, no. 4, pp. 553-558, Apr. 2019.
[3] J. Zhang, Y. Liu, J. Mao, W. Ma, and J. Xu, "User behavior simulation for search result re-ranking," ACM Trans. on Information Systems, vol. 41, no. 1, Article ID: 5, 35 pp., Jan. 2023.
[4] V. Zosimov and O. Bulgakova, "Usage of inductive algorithms for building a search results ranking model based on visitor rating evaluations," in Proc. IEEE 13th Int. Scientific and Technical Conf. on Computer Sciences and Information Technologies, CSIT'18, pp. 466-469, Lviv, Ukraine, 11-14 Sept. 2018.
[5] B. Mitra and N. Craswell, Neural Models for Information Retrieval, arXiv preprint arXiv:1705.01509, vol. 1, 2017.
[6] V. Gupta, A. Dixit, and S. Sethi, "A comparative analysis of sentence embedding techniques for document ranking," J. of Web Engineering, vol. 21, no. 7, pp. 2149-2186, 2022.
[7] J. Pennington, R. Socher, C. Ma, and C. Manning, "GloVe: global vectors for word representation," in Proc. Conf. on Empirical Methods in Natural Language Processing, EMNLP'14, pp. 1532-1543, Doha, Qatar, 25-29 Oct. 2014.
[8] T. Mikolov, K. Chen, G. Corrado, and J. Dea, "Efficient estimation of word representations in vector space," in Proc. In. Conf. on Learning Representations, ICLR'13, 12 pp., Scottsdale, AZ, USA, 2-4 May 2013.
[9] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, "Enriching word vectors with subword information," Trans. of the Association for Computational Linguistics, vol. 5, pp. 135-146, 2017.
[10] M. E. Peters, et al., "Deep contextualized word representations," in Proc. Conf. of the North American Chapter of the Association of Computational Linguistics, NAACL-HLT'18, 11 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[11] J. Devlin, M. W. Chang, and K. L. Kristina, "BERT: pre-training of deep bidirectional transformers for language understanding," in Proc. Conf. of the North American Chapter of the Association of Computational Linguistics, NAACL-HLT'18, 16 pp., New Orleans, LA, USA, 1-6 Jun. 2018.
[12] T. Brown, et al., "Language models are few-shot learners," in Proc. 34th Conf. on Neural Information Processing Systems, NeurIPS'20, 25 pp., Vancouver, Canada, 6-12 Dec. 2020.
[13] P. Sherki, S. Navali, and R. Inturi, "Retaining semantic data in binarized word embedding," in ¬Proc. IEEE 15th Int. Conf. on Semantic Computing, ICSC'21, pp. 130-133, Laguna Hills, CA, USA, 27-29 Jan. 2021.
[14] L. Shaohua, C. Tat-Seng, Z. Jun, and C. Miao, Generative Topic Embedding: A Continuous Representation of Documents (Extended Version with Proofs), arXiv preprint arXiv:1606.02979, vol. 1, 2016.
[15] B. Mitra, E. Nalisnick, N. Craswell, and R. Caruana, "A dual embedding space model for document ranking," in Proc. 25th Int. Conf. Companion on World Wide Web, WWW'16, 10 pp., Montreal, Canada, 11-15 Apr. 2016.
[16] M. Dehghani, H. Zamani, A. Severyn, and J. Kamps, "Neural ranking models with weak supervision," in Proc. of the 40th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, SIGIR '17, pp. 65-74, Tokyo, Japan, 7-11 Aug. 2017.
[17] C. Xiong, Z. Dai, and J. Callan, "End-to-end neural ad-hoc ranking with kernel pooling," in Proc. of the 40th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 55-64, Tokyo, Japan, 7-11 Aug. 2017.
[18] R. Brochier, A. Guille, and J. Velcin, "Global vectors for node representations," in Proc. ACM World Wide Web Conf., WWW'19, San Francisco, pp. 2587-2593, San Francisco, CA, USA, 13-17 May 2019.
[19] A. Gourru and J. Velcin, "Gaussian embedding of linked documents from a pretrained semantic space," in Proc. 29th Int. Joint Conf. on Artificial Intelligence, IJCAI'20, pp. 3912-3918, Yokohama, Japan, 8-10 Jan. 2021.
[20] R. Menon, J. Kaartik, and K. Nambiar, "Improving ranking in document based search systems," in Proc. 4th Int. Conf. on Trends in Electronics and Informatics, ICOEI'20, pp. 914-921, Tirunelveli, India, 15-17 Jun. 2020.
[21] J. Li, C. Guo, and Z. Wei, "Improving document ranking with relevance-based entity embeddings," in Proc. 8th Int. Conf. on Big Data and Information Analytics, BigDIA'22, China, pp. 186-192, Guiyang, China, 24-25 Aug. 2022.
[22] S. Han, X. Wang, M. Bendersky, and M. Najork, Learning-to-Rank with BERT in TF-Ranking, Google Research Tech Report, 2020.
[23] ش. بستان، ع. زارع بیدکی و م. ر. پژوهان، "درون¬سازی معنایی واژه¬ها با استفاده از BERT روی وب فارسی،" نشریه مهندسی برق و مهندسی کامپیوتر ایران، ب- مهندسی کامپیوتر، سال 21، شماره 2، صص. 100-89، تابستان 1402.
[24] M. Farahani, M. Gharachorloo, M. Farahani, and M. Manthouri, "Parsbert: transformer-based model for Persian language understanding," Neural Processing Letters, vol. 53, pp. 3831-3847, 2021.
[25] D. Yang and Y. Yin, "Evaluation of taxonomic and neural embedding methods for calculating semantic similarity," Natural Language Engineering, vol. 28, no. 6, pp. 733-761, Nov. 2022.
[26] R. Mihalcea, C. Corley, and C. Strapparava, "Corpus-based and knowledge-based measures of text semantic similarity," in Proc. 21st National Conf. on Artificial Intelligence, vol. 1, pp. 775-780, Boston, MA, USA, 16-20 Jul. 2006.
[27] K. Jarvelin and J. Kekalainen, "Cumulated gain-based evaluation of IR techniques," ACM Trans. on Information Systems, vol. 20, no. 4, pp. 422-446, Oct. 2002.