ارائه مدلی برای بازیابی اطلاعات متنی با استفاده از اعداد فاصله¬ای
محورهای موضوعی :هومان تحیری 1 , فرزاد قهرمانی 2
1 - دانشیار
2 - دانشجوی دکتری دانشگاه شیراز
کلید واژه: بازیابی اطلاعات متنی, رتبه¬بندی اسناد, وزن¬دهی لغات, اعداد فاصله¬ای, وزن فاصله¬ای,
چکیده مقاله :
با گسترش و توسعه وب و افزایش محتوای آنلاین، اهمیت سیستم های بازیابی اطلاعات که بتوانند با دقت بالاتری به نیازهای اطلاعاتی کاربران پاسخ دهند، بیشتر از پیش مشخص است. یک بخش مهم در طراحی هر سیستم بازیابی اطلاعات، انتخاب روشی مناسب برای مدل کردن آن سیستم است که در این راستا تعیین روش وزن دهی به لغات جهت بیان میزان اهمیت آنها در اسناد و پرس وجوها، نقش به سزائی دارد. روش های مختلفی در خصوص چگونگی وزن دهی به لغات ارائه شده که غالباً یک وزن عددی را تخصیص می دهند اما نمی توان با قطعیت گفت که بهترین روش وزن دهی کدام است. با توجه به ابهام و عدم قطعیتی که در این زمینه وجود دارد، در این مقاله مدلی ارائه شده که به جای استفاده از یک مقدار وزنی، با استفاده از وزن های بدست آمده از تعدادی روش وزن دهی پایه که به دقت انتخاب شده اند، برای هر لغت بازه ای از وزن ها را به عنوان یک وزن فاصله ای محاسبه می کند. در این مدل با انجام تجمیع مناسب، میزان ارتباط هر سند با پرس-وجوی ورودی نیز به صورت یک وزن فاصله ای تعیین شده و برحسب آنها می توان با استفاده از یکی از سه روش پیشنهادی، اسناد را رتبه-بندی کرد. در آزمایش های انجام شده بر روی مجموعه داده های معتبر Cranfield و Medline، اثرات نرما ل سازی طول بردار وزن های پایه، استفاده از مؤلفه های مختلف در فاکتور فرکانس لغت و فاکتور فرکانس مجموعه مورد مطالعه و بحث قرار گرفته است و مشخص شد که انتخاب مجموعه ای مناسب از روش های وزن دهی پایه برای اعمال روش پیشنهادی، به همراه استفاده از روش رتبه بندی مناسب، تأثیر به سزائی در بهبود بازدهی سیستم خواهد داشت. با انتخاب های مناسب، برای دو مجموعه داده مذکور به ترتیب MAP با مقادیر 0.43323 و 0.54580 بدست آمد. این نتایج نشان داد که روش پیشنهادی نه تنها باعث بهبود نسبت به هر یک از روش های وزن دهی پایه می شود، بلکه در مقایسه با چند روش وزن دهی پیچیده اخیر نیز بهتر عمل می کند.
Recent expansions of web demands for more capable information retrieval systems that more accurately address the users' information needs. Weighting the words and terms in documents plays an important role in any information retrieval system. Various methods for weighting the words are proposed, however, it is not straightforward to assert which one is more effective than the others. In this paper, we have proposed a method that calculates the weights of the terms in documents and queries as interval numbers. The interval numbers are derived by aggregating the crisp weights that are calculated by exploiting the existing weighting methods. The proposed method, calculates an interval number as the overall relevancy of each document with the given query. We have discussed three approaches for ranking the interval relevancy numbers. In the experiments we have conducted on Cranfield and Medline datasets, we have studied the effects of weight normalization, use of variations of term and document frequency and have shown that appropriate selection of basic term weighting methods in conjunction with their aggregation into an interval number would considerably improve the information retrieval performance. Through appropriate selection of basic weighting methods we have reached the MAP of 0.43323 and 0.54580 on the datasets, respectively. Obtained results show that he proposed method, outperforms the use of any single basic weighting method and other existing complicated weighting methods.
[1] S. Marrara, G. Pasi and M. Viviani, "Aggregation operators in information retrieval," Fuzzy Sets and Systems, vol. 324, pp. 3-19, 2017.
[2] D. H. Kraft and E. Colvin, Fuzzy Information Retrieval, North Carolina: Morgan and Claypool, 2017.
[3] D. H. Kraft, E. Colvin, G. Bordogna and G. Pasi, "Fuzzy information retrieval systems: A historical perspective," in Fifty Years of Fuzzy Logic and its Applications, Springer, Cham, 2015, pp. 267-296.
[4] H. P. Luhn, "The automatic creation of literature abstracts," IBM Journal of research and development, vol. 2, no. 2, pp. 159-165, 1958.
[5] P. Switzer, "Vector Images in Document Retrieval," in Statistical Association Methods for Mechanized Documentation: Symposium Proceedings, Washington, 1964.
[6] G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information processing & management, vol. 24, no. 5, pp. 513-523, 1988.
[7] R. Cummins, The evolution and analysis of term-weighting schemes in information retrieval, Galway: National University of Ireland, 2008.
[8] O. A. S. Ibrahim and D. Landa-Silva, "Term frequency with average term occurrences for textual information retrieval," Soft Computing, vol. 20, no. 8, pp. 3045-3061, 2016.
[9] K. Goslin and M. Hofmann, "A Wikipedia powered state-based approach to automatic search query enhancement," Information Processing & Management, vol. 54, no. 4, pp. 726-739, 2018.
[10] K. Chen, Z. Zhang, J. Long and H. Zhang, "Turning from TF-IDF to TF-IGM for term weighting in text classification," Expert Systems with Applications, vol. 66, pp. 245-260, 2016.
[11] S. Plansangket, New weighting schemes for document ranking and ranked query suggestion, PhD diss., University of Essex, 2017.
[12] D. Kandé, R. M. Marone, S. Ndiaye and F. Camara, "A Novel Term Weighting Scheme Model," in Proceedings of the 4th International Conference on Frontiers of Educational Technologies(ICFET 18), Moscow, 2018.
[13] T. Dogan and A. K. Uysal, "Improved inverse gravity moment term weighting for text classification," Expert Systems with Applications, vol. 130 , pp. 45-59, 2019.
[14] S. Balbi, M. Misuraca and G. Scepi, "Combining different evaluation systems on social media for measuring user satisfaction," Information Processing & Management, vol. 54, no. 4, pp. 674-685, 2018.
[15] H. Li, "Learning to rank for information retrieval and natural language processing," Synthesis Lectures on Human Language Technologies, vol. 4, no. 1, pp. 1-113, 2011.
[16] S. Gugnani, T. Bihany and R. K. Roul, "A complete survey on web document ranking," International Journal of Computer Applications (975 8887), vol. ICACEA, no. 2, pp. 1-7, 2014.
[17] A. H. Keyhanipour, M. Piroozmand and K. Badie, "A GP-adaptive web ranking discovery framework based on combinative content and context features," Journal of Informetrics, vol. 3, no. 1, pp. 78-89, 2009.
[18] E. Goldberg, "Statistical machine". U.S. Patent 183 838 929-1931, 1931.
[19] J. E. Holmstrom, "Section III. Opening plenary session," in The Royal Society Scientific Information Conference, London, U.K., 1948.
[20] H. F. Mitchell Jr, "The use of the univ AC FAC-tronic system in the library reference field," American Documentation, vol. 4, no. 1, pp. 16-17, 1953.
[21] M. Taube, C. D. Gull and I. S. Wachtel, "Unit terms in coordinate indexing," American Documentation, vol. 3, no. 4, pp. 213-218, 1952.
[22] H. P. Luhn, "A statistical approach to mechanized encoding and searching of literary information," IBM Journal of research and development, vol. 1, no. 4, pp. 309-317, 1957.
[23] K. S. Jones, Information retrieval experiment, Newton, MA: Butterworth-Heinemann, 1981.
[24] S. E. Robertson, "The probability ranking principle in IR," Journal of documentation, vol. 33, no. 4, pp. 294-304, 1977.
[25] M. Sanderson and W. B. Croft, "The history of information retrieval research," Proceedings of the IEEE, vol. 100, no. Special Centennial Issue, pp. 1444-1451, 2012.
[26] H. F. Witschel, "Global term weights in distributed environments," Information Processing & Management, vol. 44, no. 3, pp. 1049-1061, 2008.
[27] Y. Gupta, A. Saini and A. K. Saxena, "A new fuzzy logic based ranking function for efficient information retrieval system," Expert Systems with Applications, vol. 42, no. 3, pp. 1223-1234, 2015.
[28] A. I. Kadhim, "Term Weighting for Feature Extraction on Twitter: A Comparison Between BM25 and TF-IDF," in International Conference on Advanced Science and Engineering (ICOASE), 2019.
[29] C. Kamphuis, A. P. de Vries, L. Boytsov and J. Lin, "Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants," in European Conference on Information Retrieval, Cham, 2020.
[30] J. M. Ponte and W. B. Croft, "A language modeling approach to information retrieval," in In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998.
[31] C. Zhai and J. Lafferty, "A study of smoothing methods for language models applied to information retrieval," ACM Transactions on Information Systems (TOIS), vol. 22, no. 2, pp. 179-214, 2004.
[32] R. Cummins, Modelling Word Burstiness in Natural Language: A Generalised Polya Process for Document Language Models in Information Retrieval, arXiv preprint arXiv:1708.06011, 2017.
[33] R. Cummins, J. H. Paik and Y. Lv, "A Pólya urn document language model for improved information retrieval," ACM Transactions on Information Systems (TOIS), vol. 33, no. 4, p. 21, 2015.
[34] G. Salton, Automatic Information Organization and Retrieval, New York: McGraw-Hill, 1968.
[35] K. Sparck Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of documentation, vol. 28, no. 1, pp. 11-21., 1972.
[36] G. Salton and C.-S. Yang, "On the specification of term values in automatic indexing," Journal of documentation, vol. 29, no. 4, pp. 351-372, 1973.
[37] G. Salton, A. Wong and C.-S. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, pp. 613-620, 1975.
[38] F. S. Al-Anzi, D. AbuZeina and S. Hasan, "Utilizing standard deviation in text classification weighting schemes," Int J Innov Comput Inf Control, vol. 13, no. 4, pp. 1385-1398, 2017.
[39] J. Beel, S. Langer and B. Gipp, "Tf-iduf: A novel term-weighting scheme for user modeling based on users’ personal document collections," in iConference 2017, Wuhan, China, 2017.
[40] L. Bernauer, E. J. Han and S. Y. Sohn, "Term discrimination for text search tasks derived from negative binomial distribution," Information Processing & Management, vol. 54, no. 3, pp. 370-379, 2018.
[41] R. Lakshmi and S. Baskar, "Novel Term Weighting Schemes for Document Representation based on Ranking of Terms and Fuzzy Logic with Semantic Relationship of Terms," Expert Systems with Applications, vol. 137, pp. 493-503, 2019.
[42] F. Carvalho and G. P. Guedes, TF-IDFC-RF: A Novel Supervised Term Weighting Scheme, arXiv preprint arXiv:2003.07193, 2020.
[43] W. B. Frakes and R. Baeza-Yates, Eds., Information retrieval: Data structures & algorithms, vol. 331, Englewood Cliffs, NJ: prentice Hall, 1992.
[44] G. Bordogna and G. Pasi, "Controlling retrieval through a user-adaptive representation of documents," International Journal of Approximate Reasoning, vol. 12, no. 3-4, pp. 317-339, 1995.
[45] D. H. Kraft, G. Bordogna and G. Pasi, "An extended fuzzy linguistic approach to generalize Boolean information retrieval," Information Sciences-Applications, vol. 2, no. 3, pp. 119-134, 1995.
[46] Y. Y. Yao, "Interval-set algebra for qualitative knowledge representation," in Proceedings of ICCI'93: 5th International Conference on Computing and Information, 1993.
[47] J. M. Mendel and D. Wu, Perceptual computing: Aiding people in making subjective judgments, vol. 13, John Wiley & Sons, 2010.
[48] J. Han, J. Pei and M. Kamber, Data mining: concepts and techniques, Elsevier, 2011.
[49] A. Turpin and F. Scholer, "User performance versus precision measures for simple search tasks," in In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006.
[50] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval: the concepts and technology behind search, 2 ed., Harlow: England: Pearson Education Ltd., 2011.