انتخاب ويژگي براي شناسايي نويسنده در متون کوتاه برخط فارسي

الموضوعات :

سمیه عارفی ¹ , محمد احسان بصیری ² , امید روزمند ³

1 - دانشجو
2 - مربی
3 - مربی

تاريخ الإرسال : 25 السبت , ذو الحجة, 1441 تاريخ التأكيد : 29 الخميس , جمادى الثانية, 1442 تاريخ الإصدار : 27 السبت , محرم, 1443

الکلمات المفتاحية: تحليل متن, تحليل سبک, استخراج ويژگي, انتخاب ويژگي و شناسايي نويسنده.,

ملخص المقالة :

رشد فزاينده‏ي استفاده از رسانه‌هاي اجتماعي و ارتباطات برخط به‌منظور بيان نظرات، تبادل عقايد و همچنين گسترش استفاده‏ي کاربران فارسي زبان از اين ابزارها باعث افزايش متون فارسي در وب شده است. اين رشد چشمگير در کنار سوءاستفاده‏هاي ناشي از ناشناس بودن نويسنده‏ي نوشته‏ها نياز به سامانه‏ي خودکار شناسايي نويسنده در اين زبان را بيش از پيش آشکار مي‌سازد. هدف از اين پژوهش، بررسي ويژگي‌هاي مؤثر در شناسايي نويسندگان نظرات فارسي توليد شده توسط خريداران گوشي و همچنین ارزیابی روش‌های نظارتی و غیرنظارتی می‌باشد. عواملي که در اين پژوهش بررسي مي‏شود شامل ويژگي‌هاي لغوي، نگارشی، معنايي، ساختاري، دستوري، مختص متن و مختص شبکه‌هاي اجتماعي است. پس از استخراج ويژگي‌هاي مذکور، انتخاب ويژگي‌هاي برتر توسط چهار الگوريتم همبستگي ويژگي، نسبت بهره، OneR و تحليل اجزاي اصلي آزمايش مي‏شود. در ادامه از الگوريتم‏هاي K-means، EM و خوشه‏بندي مبتني بر چگالي براي خوشه‌بندي و الگوريتم‏هاي شبکه‏ي بيز، جنگل تصادفي و Bagging براي دسته‏بندي استفاده خواهد شد. ارزيابي الگوريتم‌هاي فوق بر روي نظرات فارسي مربوط به خريداران گوشي‌هاي سامسونگ نشان مي‏دهد که بهترين تشخيص در بين الگوريتم‏هاي خوشه‏بندي با دقت 16/59% مربوط به الگوريتم EM روي 15 ويژگي‌ برتر انتخابي توسطOneR است درحالي‌که الگوريتم جنگل تصادفي به‌همراه نسبت بهره برای 90 ویژگی با دقت 57/79% بهترين کارايي را در بين الگوريتم‏هاي دسته‏بندي دارد. همچنين مقايسه‌ی ويژگي‌ها نشان داد که ويژگي‌هاي نگارشی بيشترين تأثير را در شناسايي نويسنده‏ي متون کوتاه داشته و پس از آن‌ به‌ترتيب ويژگي‌هاي لغوي ، مختص متن، مختص شبکه‌های اجتماعی، ساختاري، دستوري و معنایی قرار گرفتند.

المصادر:

مرادي، مهدی و بحراني، محمد، “تشخيص خودکار جنسيت نويسنده در متون فارسي”، فصل‌نامه پردازش علائم و داده‌ها، شماره 4، پیاپی 26، صفحات 83-94، 1394.
[2] فرهمندپور، زینب، نیک‌مهر، هومان، منصوری زاده، محرم و طبیب زاده قمصری، اميد، “يک سيستم نوين هوشمند تشخيص هويت نويسنده فارسي زبان بر اساس سبک نوشتاري-مقاله برگزيده هفدهمين کنفرانس ملي انجمن کامپيوتر ايران”، مجله محاسبات نرم، شماره دوم، صفحات 35-26، 1391.
[3] F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi, “Mining writeprints from anonymous e-mails for forensic investigation,” Digit. Investig., vol. 7, no. 1–2, pp. 56–64, 2010.
[4] S. Nirkhi, R. V Dharaskar, and V. M. Thakare, “Authorship Verification of Online Messages for Forensic Investigation,” Procedia Comput. Sci., vol. 78, pp. 640–645, 2016, doi: https://doi.org/10.1016/j.procs.2016.02.111.
[5] M. L. Brocardo, I. Traore, and I. Woungang, “Authorship verification of e-mail and tweet messages applied for continuous authentication,” J. Comput. Syst. Sci., vol. 81, no. 8, pp. 1429–1440, 2015.
[6] Y. Yiming and P. Jan O., “A Comparative Study on Feature Selection in Text Categorization,” Proceeding ICML ’97 Proc. Fourteenth Int. Conf. Mach. Learn., vol. 53, no. 9, pp. 412–420, 1997.
[7] M. Frederick and L. Wallace David, “Inference and Disputed Authorship: The Federalist. Reading, Addison.” Wessley Publishing Company. Republié sous le titre Applied Bayesian and …, 1984.
[8] T. C. Mendenhall, “The Characteristic Curves of Composition,” Science (80-. )., vol. 9, no. 214, pp. 237–249, Dec. 1887, [Online]. Available: http://www.jstor.org/stable/1764604.
[9] H. Craig, “Authorial attribution and computational stylistics: If you can tell authors apart, have you learned anything about them?,” Lit. Linguist. Comput., vol. 14, no. 1, pp. 103–113, 1999.
[10] M. Koppel and J. Schler, “Authorship verification as a one-class classification problem,” in Proceedings of the twenty-first international conference on Machine learning, 2004, p. 62.
[11] E. Villar-Rodriguez, J. Del Ser, M. N. Bilbao, and S. Salcedo-Sanz, “A feature selection method for author identification in interactive communications based on supervised learning and language typicality,” Eng. Appl. Artif. Intell., vol. 56, pp. 175–184, 2016, doi: https://doi.org/10.1016/j.engappai.2016.09.004.
[12] P. Geutner, U. Bodenhausen, and A. Waibel, “Flexibility through incremental learning: Neural networks for text categorization,” in Proceedings of WCNN-93, World Congress on Neural Networks, 1993, pp. 24–27.
[13] O. De Vel, “Mining e-mail authorship,” 2000.
[14] M. Corney, O. De Vel, A. Anderson, and G. Mohay, “Gender-preferential text mining of e-mail discourse,” in 18th Annual Computer Security Applications Conference, 2002. Proceedings., 2002, pp. 282–289.
[15] F. Iqbal, R. Hadjidj, B. C. M. Fung, and M. Debbabi, “A novel approach of mining write-prints for authorship attribution in e-mail forensics,” Digit. Investig., vol. 5, pp. S42–S51, 2008.
[16] A. Abbasi and H. Chen, “Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace,” ACM Trans. Inf. Syst., vol. 26, no. 2, pp. 1–29, 2008.
[17] F. Iqbal, L. A. Khan, B. C. M. Fung, and M. Debbabi, “E-Mail Authorship Verification for Forensic Investigation,” in Proceedings of the 2010 ACM Symposium on Applied Computing, 2010, pp. 1591–1598, doi: 10.1145/1774088.1774428.
[18] B. Allison and L. Guthrie, “Authorship Attribution of E-Mail: Comparing Classifiers over a New Corpus for Evaluation.,” 2008.
[19] N. Cheng, R. Chandramouli, and K. P. Subbalakshmi, “Author gender identification from text,” Digit. Investig., vol. 8, no. 1, pp. 78–88, 2011.
[20] X. Chen, P. Hao, R. Chandramouli, and K. P. Subbalakshmi, “Authorship similarity detection from email messages,” in International Workshop on Machine Learning and Data Mining in Pattern Recognition, 2011, pp. 375–386.
[21] J. Keeshin, Z. Galant, and D. Kravitz, “Machine Learning and Feature Based Approaches to Gender Classification of Facebook Statuses.” 2010.
[22] R. Layton, P. Watters, and R. Dazeley, “Authorship Attribution for Twitter in 140 Characters or Less,” in 2010 Second Cybercrime and Trustworthy Computing Workshop, Jul. 2010, pp. 1–8, doi: 10.1109/CTC.2010.17.
[23] C. Li, A. Sun, and A. Datta, “Twevent: Segment-Based Event Detection from Tweets,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management, 2012, pp. 155–164, doi: 10.1145/2396761.2396785.
[24] J. S. Li, J. V Monaco, L.-C. Chen, and C. C. Tappert, “Authorship authentication using short messages from social networking sites,” in 2014 IEEE 11th International Conference on e-Business Engineering, 2014, pp. 314–319.
[25] A. Zubiaga, D. Spina, R. Martínez, and V. Fresno, “Real‐time classification of twitter trends,” J. Assoc. Inf. Sci. Technol., vol. 66, no. 3, pp. 462–473, 2015.
[26] A. Orebaugh, “An Instant Messaging Intrusion Detection System Framework: Using character frequency analysis for authorship identification and validation,” in Proceedings 40th Annual 2006 International Carnahan Conference on Security Technology, 2006, pp. 160–172.
[27] O. Canales et al., “A stylometry system for authenticating students taking online tests,” P. Student-Faculty Res. Day, Ed., CSIS. Pace Univ., 2011.
[28] C.-Y. Lai, “Author Gender Analysis’,” Final Proj. from I, vol. 256, 2009.
[29] H. Alam and A. Kumar, “Multi-lingual author identification and linguistic feature extraction—A machine learning approach,” in 2013 IEEE International Conference on Technologies for Homeland Security (HST), 2013, pp. 386–389.
[30] J. Adams, H. Williams, J. Carter, and G. Dozier, “Genetic Heuristic Development: Feature selection for author identification,” in 2013 IEEE Symposium on Computational Intelligence in Biometrics and Identity Management (CIBIM), 2013, pp. 36–41.
[31] J. Houvardas and E. Stamatatos, “N-gram feature selection for authorship identification,” in International conference on artificial intelligence: Methodology, systems, and applications, 2006, pp. 77–86.
[32] A. K. Uysal and S. Gunal, “A novel probabilistic feature selection method for text classification,” Knowledge-Based Syst., vol. 36, pp. 226–235, 2012, doi: https://doi.org/10.1016/j.knosys.2012.06.005.
[33] زنگويي، سمیرا، نعمتی شمس‌آباد، حسنعلی “شناسايي نويسندگان پيام هاي الکترونيکي از طريق واکاوي نوع و سبک نگارش آن ها مبتني بر روش هاي يادگيري ماشين(WKF based on SVM-PHGS) ”، پردازش و مديريت اطلاعات (علوم و فناوري اطلاعات)، شماره 2، دوره 29، صفحات 476-453، 1392.
[34] G. U. Yule, “The statistical study of literary vocabulary. Cambridge, Cambridge [Eng.].” University Press. Journal of the Royal Statistical Society, 1944.
[35] A. Honoré, “Some simple measures of richness of vocabulary,” Assoc. Lit. Linguist. Comput. Bull., vol. 7, no. 2, pp. 172–177, 1979.
[36] E. Brunet, Le Vocabulaire de Jean Giraudoux: structure et évolution : statistique et informatique appliquées à l’étude des textes à partir des données du Trésor de la langue française. Slatkine, 1978.
[37] H. S. Sichel, “On a Distribution Law for Word Frequencies,” J. Am. Stat. Assoc., vol. 70, no. 351a, pp. 542–547, 1975, doi: 10.1080/01621459.1975.10482469.
[38] E. H. SIMPSON, “Measurement of Diversity,” Nature, vol. 163, no. 4148, p. 688, 1949, doi: 10.1038/163688a0.
[39] S. Nemati, M. E. Basiri, N. Ghasem-Aghaee, and M. H. Aghdam, “A novel ACO–GA hybrid algorithm for feature selection in protein function prediction,” Expert Syst. Appl., vol. 36, no. 10, pp. 12086–12094, 2009, doi: https://doi.org/10.1016/j.eswa.2009.04.023.

شارک

عنوان URL للمقالة

انتخاب ويژگي براي شناسايي نويسنده در متون کوتاه برخط فارسي

رایمگ

الروابط

المراكز ذات الصلة

دعامة

الصفحات الرسمية