تشخیص صفحات اسپم با استفاده از الگوریتم XGBoost

محورهای موضوعی : مهندسی برق و کامپیوتر

ریحانه رشیدپور ¹ , علی محمد زارع بیدکی ²

1 - دانشكده مهندسي كامپيوتر، دانشگاه یزد، یزد، ایران
2 - دانشكده مهندسي كامپيوتر، دانشگاه یزد، یزد، ایران

تاریخ دریافت : 1403/03/29 تاریخ پذیرش : 1403/06/10 تاریخ انتشار : 1404/03/13

کلید واژه: اسپم وب, الگوریتم دسته‌بندی XGBoost, متوازن‌سازی داده, یادگیری ماشین.,

چکیده مقاله :

امروزه موتورهای جستجو دروازه ورود به وب هستند. با افزایش محبوبیت وب، تلاش برای بهره‌برداری تجاری، اجتماعی و سیاسی از وب نیز افزایش یافته و در نتیجه تشخیص یک محتوای خوب از اسپم برای موتورهای جستجو دشوار شده است. مفهوم اسپم وب نخستین بار در سال 1996 معرفی شد و خیلی زود به عنوان یکی از چالش‌های کلیدی برای صنعت موتور جستجو شناخته شد. پدیده اسپم اساساً به این دلیل اتفاق می‌افتد که بخش قابل توجهی از مراجعات به صفحه وب از موتور جستجو می‌آیند و کاربران تمایل به بررسی اولین نتایج جستجو دارند. هدف از شناسایی صفحات اسپم این است که این صفحات با استفاده از استراتژی‌های فریب قادر به کسب رتبه بالا نباشند. تلاش ما ارائه روشی مؤثر در شناسایی صفحات اسپم و در نتیجه کاهش حضور اسپم در نتایج اول جستجوست. در این مقاله دو روش برای مقابله با اسپم وب پیشنهاد شده است. روش اول به نام XGspam صفحات اسپم را بر اساس الگوریتم یادگیری XGBoost با دقت 27/94% شناسایی می‌کند. در روش دوم به نام XGSspam راهکاری برای چالش نامتوازن‌بودن داده‌های وب با استفاده از ترکیب الگوریتم بیش‌نمونه‌برداری SMOTE با مدل دسته‌بندی XGBoost ارائه شده که به دقت 44/95% در شناسایی صفحات اسپم می‌رسد.

چکیده انگلیسی:

Today, search engines are the gateway to the web. With the increasing popularity of the web, the efforts to exploit it for commercial, social, and political purposes have also increased, making it difficult for search engines to distinguish good content from spam. The concept of web spam was first introduced in 1996 and quickly became recognized as one of the key challenges for the search engine industry. The phenomenon of spam occurs primarily because a significant portion of web page visits comes from search engines, and users tend to check the first search results. The goal of identifying spam pages is to ensure that these pages cannot achieve high rankings using deceptive strategies. Our effort is to provide an effective method for identifying spam pages, thereby reducing the presence of spam in the top search results. In this article, two methods for combating web spam are proposed. The first method, called XGspam, identifies spam pages based on the XGBoost learning algorithm with an accuracy of 94.27%. The second method, named XGSspam, offers a solution to the challenge of imbalanced web data by combining the SMOTE oversampling algorithm with the XGBoost classification model, achieving an accuracy of 95.44% in identifying spam pages.

منابع و مأخذ:

[1] E. Convey, "Porn sneaks way back on web," The Boston Herald, vol. 28, 1996.
[2] M. De Kunder, "he Size of the World Wide Web (The Internet), https://www.worldwidewebsize.com, Retrived 2024. [3] A. Shahzad, N. M. Nawi, M. Z. Rehman, and A. Khan, "An improved framework for content‐and link‐based web‐spam detection: a combined approach," Complexity, vol. 2021, Article ID: 6625739, 18 pp., 2021.
[4] C. Castillo, Web Spam Collections, https://chato.cl/webspam/datasets/uk2007, Retrived 2024.
[5] T. Chen and C. Guestrin, "Xgboost: a scalable tree boosting system," in Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 785-794, San Francisco, CA, USA, 13-17 Aug. 2016.
[6] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: synthetic minority over-sampling technique," Artificial Intelligence Research, vol. 16, no. 1, pp. 321-357, Jan. 2002.
[7] J. Liu, Y. Su, S. Lv, and C. Huang, "Detecting web spam based on novel features from web page source code," Security and Communication Networks, vol. 2020, Article ID: 6662166, 14 pp., 2020.
[8] F. Asdaghi and A. Soleimani, "An effective feature selection method for web spam detection," Knowledge-Based Systems, vol. 166, pp. 198-206, Feb. 2019.
[9] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, "Detecting spam web pages through content analysis," in Proc. World Wide Web, pp. 83-92, Edinburgh, Scotland, 23-26 May 2006.
[10] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, "Using rank propagation and probabilistic counting for link-based spam detection," in Proc. the WebKDD, 10 pp., 2006.
[11] R. Baeza-Yates, P. Boldi, and C. Castillo, "Generalizing PageRank: damping functions for link-based ranking algorithms," in Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 308-315, Seattle, WA, USA, 6-11 Aug. 2006.
[12] M. Yu, J. Zhang, J. Wang, J. Gao, T. Xu, and R. Yu, "The research of spam web page detection method based on web page differentiation and concrete clusters centers," in Proc. Int. Conf. on Wireless Algorithms, Systems, and Applications, pp. 820-826, Tianjin, China, 20-22 Jun. 2018.
[13] J. J. Whang, Y. S. Jeong, I. Dhillon, S. Kang, and J. Lee, "Fast asynchronous antitrust rank for web spam detection," in Proc. WSDM Workshop on Misinformation and Misbehavior Mining on the Web, 4 pp., Marina Del Rey, CA, USA, 5-9 Feb. 2018.
[14] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen, "Combating web spam with trustrank," in Proc.Very Large Data Bases, vol. 30, pp. 576-587, Toronto, Canada, 31 Aug.-3 Sept. 2004.
[15] M. Sobek, Pr0-Google’s Pagerank 0 Penalty, http://pr.efactory.de/e-pr0.shtml, Retrived 2024.
[16] D. Liu and J. Lee, "CNN based malicious website detection by invalidating multiple web spams," IEEE Access, vol. 8, pp. 97258-97266, 2020.
[17] X. Zhuang, Y. Zhu, Q. Peng, and F. Khurshid, "Using deep belief network to demote web spam," Future Generation Computer Systems, vol. 118, pp. 94-106, May 2021.
[18] C. Wei, Y. Liu, M. Zhang, S. Ma, L. Ru, and K. Zhang, "Fighting against web spam: a novel propagation method based on click-through data," in Proc. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 395-404, Portland, ON, USA, 12-16 Aug. 2012.
[19] A. Heydari, M. A. Tavakoli, N. Salim, and Z. Heydari, "Detection of review spam: a survey," Expert Systems with Applications, vol. 42, no. 7, pp. 3634-3642, May 2015.
[20] S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," Computer Networks and ISDN Systems, vol. 30, no. 1-7, pp. 107-117, Apr. 1998.
[21] D. Sculley, Kaggle: Your Machine Learning and Data Science Community, https://www.kaggle.com, Retrived 2024.
[22] X. Ren, Knowledge Dscovery in Data and Data-Mining, https://kdd.org/, Retrieved 2024.
[23] T. Wongvorachan, S. He, and O. Bulut, "A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining," Information, vol. 14, no. 1, Article ID: 54, 2023.
[24] Y. Zhang, L. Deng, and B. Wei, "Imbalanced data classification based on improved random-SMOTE and feature standard deviation," Mathematics, vol. 12, no. 11, Article ID: 1709, 2024.

اشتراک گذاری

آدرس مقاله

تشخیص صفحات اسپم با استفاده از الگوریتم XGBoost

رایمگ

پیوندهای سایت

مراکز مرتبط

پشتیبانی

صفحات رسمی