تشخیص روباتهای وب با استفاده از نظریه مجموعههای فازی ناهموار
محورهای موضوعی : مهندسی برق و کامپیوتر
                                                    
                                                             سمانه رحیمی
                                                        
                                                            1
                                                        
                                                    
    ,                                                    
                                                             جواد حمیدزاده 
                                                        
                                                            2
                                                                *
                                                            
                                                                
                                                                     
                                                                
                                                        
                                                    
                                    
                                               1 -     دانشگاه بینالمللی امام رضا علیهالسلام
                                               
                                               2 -     دانشگاه صنعتی سجاد
                                               
                                       
چکیده مقاله :
روباتهای وب، برنامههای نرمافزاری هستند که به طور خودکار در اینترنت اجرا میشوند و مهمترین وظیفه آنها واکشی اطلاعات و ارسال آنها به سرویسدهنده مبدأ است. مصرف زیاد پهنای باند شبکه توسط آنها و کاهش کارایی سرویسدهنده باعث شده تا مسأله تشخیص روباتهای وب مطرح شود. در این مقاله از نظریه مجموعههای فازی ناهموار برای تشخیص روباتهای وب استفاده شده است. روش پیشنهادی شامل چهار مرحله است. در مرحله اول، نشستهای کاربران وب توسط خوشهبندی مجموعههای فازی ناهموار شناسایی میشود. در مرحله دوم، برداری شامل 10 ویژگی متمایز برای هر نشست استخراج میگردد. در مرحله سوم نشستهای شناساییشده توسط یک روش مکاشفهای برچسبگذاری میشود. در مرحله چهارم این برچسبها با استفاده از طبقهبندی مجموعههای فازی ناهموار بهبود مییابد. کارایی روش پیشنهادی بر روی مجموعه دادههای واقعی ارزیابی شده است. نتایج آزمایشها نشاندهنده برتری روش پیشنهادی نسبت به سایر روشهای مطرح از نظر معیار F است.
Web robots are software programs that traverse the internet autonomously. Their most important task is to fetch information and send it to the origin server. The high consumption of network bandwidth by them and server performance reduction, have caused the web robot detection problem. In this paper, fuzzy rough set theory has been used for web robot detection. The proposed method includes 4 phases. In the first phase, user sessions have identified using fuzzy rough set clustering. In the second phase, a vector of 10 features is extracted for each session. In the third phase, the identified sessions are labeled using a heuristic method. In the fourth phase, these labels are improved using fuzzy rough set classification. The proposed method performance has been evaluated on a real world dataset. The experimental results have been compared with state-of-the-art methods, and show the superiority of the proposed method in terms of F-measure.
[1] D. Doran and S. S. Gokhale, "Web robot detection techniques: overview and limitations," Data Min Knowl Disc, vol. 22, no. 1-2, pp. 183-210, Jan. 2011.
[2] N. Algiriyage, S. Jayasena, G. Dias, A. Perera, and K. Dayananda, "Identification and characterization of crawlers through analysis of web logs," in Proc. IEEE 8th Int. Conf. on Industrial and Information Systems, ICIIS'13, pp. 150-155. Dec. 2013.
[3] J. Patel and H. Jethva, "Web crawling," International J. of Innovations & Advancement in Computer Science, vol. 4, pp. 228-235, May 2015.
[4] A. Stassopoulou and M. D. Dikaiakos, "Web robot detection: a probabilistic reasoning approach," Computer Networks, vol. 53, no. 3, pp. 265-278, Feb. 2009.
[5] D. Stevanovic, A. An, and N. Vlajic, "Feature evaluation for web crawler detection with data mining techniques," Expert Systems with Applications, vol. 39, no. 10, pp. 8707-8717, Aug. 2012.
[6] D. Stevanovic, N. Vlajic, and A. An, "Detection of malicious and non-malicious website visitors using unsupervised," Applied Soft Computing, vol. 13, no. 1, pp. 698-708, Jan. 2013.
[7] D. Doran, Detection, Classification, and Workload Analysis of Web Robots, University of Connecticut, 2014.
[8] T. H. Sardar and Z. Ansari, "Detection and confirmation of web robot requests for cleaning the voluminous web log data," in Proc. IEEE Int. Conf. on the IMpact of E-Technology on US, IMPETUS'14pp. 13-19, Jan. 2014.
[9] Q. Bai, G. Xiong, Y. Zhao, and L. He, "Analysis and detection of bogus behavior in web crawler measurement," Procedia Computer Science, vol. 31, pp. 1084-1091, Dec. 2014.
[10] D. Doran, K. Morillo, and S. S. Gokhale, "A comparison of web robot and human requests," in Proc. of the IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining, ACM, pp. 1374-1380, Aug. 2013.
[11] M. D. Dikaiakosa, A. Stassopouloub, and L. Papageorgioua, "An investigation of web crawler behavior: characterization and metrics," Computer Communications, vol. 28, no. 8, pp. 880-897, May 2005.
[12] Z. Chu, S. Gianvecchio, A. Koehl, H. Wang, and S. Jajodia, "Blog or block: detecting blog bots through behavioral biometrics," Computer Networks, vol. 57, no. 3, pp. 634-646, Feb. 2013.
[13] D. Zhang, D. Zhang, and X. Liu, "A novel malicious web crawler detector: performance and evaluation," IJCSI International J. of Computer Science Issues, vol. 10, no. 1, pp. 121-126, Jan. 2013.
[14] I. Ghafir and V. Prenosil, "DNS traffic analysis for malicious domains detection," in Proc. 2nd Int. Conf. on Signal Processing and Integrated Network,s SPIN'15, pp. 613-918, Feb. 2015.
[15] M. Zabihi, M. V. Jahan, and J. Hamidzadeh, "A density based clustering approach to distinguish between web robot and human requests to a web server," The ISC Int'l J. of Information Security, vol. 6, no. 1, pp. 1-13, Jan. 2014.
[16] Z. Pawlak, "Rough sets," International J. of Computer and Information Sciences, vol. 11, no. 5, pp. 341-356, Oct. 1982.
[17] A. Anitha, "An efficient agglomerative clustering algorithm for web navigation pattern identification," Circuits and Systems, vol. 7, no. 9, pp. 2349-2356, Jul. 2016.
[18] R. Sadeghi and J. Hamidzadeh, "Automatic support vector data description," Soft Computing, 12 pp., 2016, DOI s00500-016-2317-5.
[19] K. Thangavel and R. Roselin, "Fuzzy-rough feature selection with Π-membership function for mammogram classification," International J. of Computer Science Issues, vol. 9, no. 4, pp. 361-370, May 2012.
[20] A. Zeng, T. Li, D. Liu, J. Zhang, and H. Chen, "A fuzzy rough set approach for incremental feature selection on hybrid information systems," Fuzzy Sets and Systems, vol. 258, pp. 39-60, Jan. 2015.
[21] N. Verbiest, Fuzzy Rough and Evolutionary Approaches to Instance Selection, Doctoral Dissertation, Ghent University, 2014.
[22] N. Verbiest, C. Cornelis, and F. Herrera, "FRPS: a fuzzy rough prototype selection method," Pattern Recognition, vol. 46, no. 10, pp. 2770-2782, Oct. 2013.
[23] J. Hamidzadeh, M. Zabihimayvan, and R. Sadeghi, "Detection of Web site visitors based on fuzzy rough sets," Soft Computing, 14 pp., 2016, DOI s00500-016-2476-4.
[24] D. U. Maheswari and A. Marimuthu, "An ensemble fuzzy rough set jaccard similarity measure based approach on user session clustering," International J. of Computer Systems, vol. 3, no. 4, pp. 330-334, Apr. 2016.
[25] T. V. Kumar and H. Guruprasad, "Clustering of web usage data using fuzzy tolerance rough set similarity and table filling algorithm," Cancer Research and Oncology, vol. 1, no. 3, pp. 143-152, Jun. 2013.
[26] D. S. Sisodia, S. Verma, and O. P. Vyas, "Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors," J. of Data Analysis and Information Processing, vol. 3, no. 1, pp. 1-10, Apr. 2015.
[27] W. Dong, et al., "Web robot detection with semi-supervised learning method," in Proc. 3rd Int. Conf. on Material, Mechanical and Manufacturing Engineering, IC3ME'15, pp. 2123-2128, 2015.
[28] G. Suchacka and M. Sobkow, "Detection of internet robots using a bayesian approach," in Proc. 2nd IEEE Int. Conf. on Cybernetics, CYBCONF'15, pp. 365-370, Jun. 2015.
[29] T. Grzinic, L. Mrsic, and J. Saban, Lino-An Intelligent System for Detecting Malicious Web-Robots, Intelligent Information and Database Systems, Springer International Publishing, pp. 559-568, 2015.
[30] A. M. Radzikowska and E. E. Kerre, "A comparative study of fuzzy rough sets," Fuzzy Sets and Systems, vol. 126, no. 2, pp. 137-155, Mar. 2002.
[31] W. Cohen, P. Ravikumar, and S. E. Fienberg, "A comparison of string distance metrics for name-matching tasks," in Proc. American Association for Artificial Intelligence, IIWeb'03, pp. 73-78, Acapulco, Mexico, 9-10 Aug. 2003.
[32] W. H. Gomaa and A. A. Fahmy, "A survey of text similarity approaches," International J. of Computer Applications, vol. 68, no. 13, pp. 13-18, Jan. 2013.
[33] M. A. Jaro, "Probabilistic linkage of large public health data files," Statistics in Medicine, vol. 14, no. 5-7. pp. 491–498, Apr. 1995..
[34] List of User-Agents (Spiders, Robots, Browser), Retrieved from http://www.user-agents.org and www.UserAgentString.com, 2015.
[35] E. Alpaydin, Introduction to Machine Learning, MIT Press, 2014.
[36] S. Arlot and A. Celisse, "A survey of cross-validation procedures for model selection," Statistics Surveys, vol. 4, pp. 40-79, 2010.
[37] S. Cifci, Y. Ekinci, G. Whyatt, A. Japutra, S. Molinillo, and H. Siala, "A cross validation of consumer-based brand equity models: driving customer equity in retail brands," J. of Business Research, vol. 69, no. 9, pp. 3740-3747, Sept. 2016.
[38] J. Hamidzadeh, R. Monsefi, and H. S. Yazdi, "IRAHC: instance reduction algorithm using hyperrectangle clustering," Pattern Recognition, vol. 48, no. 5, pp. 1878-1889, May 2015.

