An Effective Method of Feature Selection in Persian Text for Improving the Accuracy of Detecting Request in Persian Messages on Telegram
محورهای موضوعی : Machine learningzahra khalifeh zadeh 1 , Mohammad Ali Zare Chahooki 2
1 - Yazd University
2 - Yazd University
کلید واژه: Feature Selection, , Text Mining, , Classification Accuracy, , Machine Learning, , Ensemble Classifier,
چکیده مقاله :
In recent years, data received from social media has increased exponentially. They have become valuable sources of information for many analysts and businesses to expand their business. Automatic document classification is an essential step in extracting knowledge from these sources of information. In automatic text classification, words are assessed as a set of features. Selecting useful features from each text reduces the size of the feature vector and improves classification performance. Many algorithms have been applied for the automatic classification of text. Although all the methods proposed for other languages are applicable and comparable, studies on classification and feature selection in the Persian text have not been sufficiently carried out. The present research is conducted in Persian, and the introduction of a Persian dataset is a part of its innovation. In the present article, an innovative approach is presented to improve the performance of Persian text classification. The authors extracted 85,000 Persian messages from the Idekav-system, which is a Telegram search engine. The new idea presented in this paper to process and classify this textual data is on the basis of the feature vector expansion by adding some selective features using the most extensively used feature selection methods based on Local and Global filters. The new feature vector is then filtered by applying the secondary feature selection. The secondary feature selection phase selects more appropriate features among those added from the first step to enhance the effect of applying wrapper methods on classification performance. In the third step, the combined filter-based methods and the combination of the results of different learning algorithms have been used to achieve higher accuracy. At the end of the three selection stages, a method was proposed that increased accuracy up to 0.945 and reduced training time and calculations in the Persian dataset.
In recent years, data received from social media has increased exponentially. They have become valuable sources of information for many analysts and businesses to expand their business. Automatic document classification is an essential step in extracting knowledge from these sources of information. In automatic text classification, words are assessed as a set of features. Selecting useful features from each text reduces the size of the feature vector and improves classification performance. Many algorithms have been applied for the automatic classification of text. Although all the methods proposed for other languages are applicable and comparable, studies on classification and feature selection in the Persian text have not been sufficiently carried out. The present research is conducted in Persian, and the introduction of a Persian dataset is a part of its innovation. In the present article, an innovative approach is presented to improve the performance of Persian text classification. The authors extracted 85,000 Persian messages from the Idekav-system, which is a Telegram search engine. The new idea presented in this paper to process and classify this textual data is on the basis of the feature vector expansion by adding some selective features using the most extensively used feature selection methods based on Local and Global filters. The new feature vector is then filtered by applying the secondary feature selection. The secondary feature selection phase selects more appropriate features among those added from the first step to enhance the effect of applying wrapper methods on classification performance. In the third step, the combined filter-based methods and the combination of the results of different learning algorithms have been used to achieve higher accuracy. At the end of the three selection stages, a method was proposed that increased accuracy up to 0.945 and reduced training time and calculations in the Persian dataset.
[1] W. Y. Wang, D. J. Pauleen, and T. Zhang. "How social media applications affect B2B communication and improve business performance in SMEs". Industrial Marketing Management, vol. 54, pp. 4–14, 2016.
[2] E. Omer, "Using machine learning to identify jihadist messages on Twitter". M.S Theses, Dept. Information Technology, Uppsala Univ., Sweden, 2015.
[3] J. Surma and A. Furmanek. "Improving marketing response by data mining in social network ", in 2010 International Conference on Advances in Social Networks Analysis and Mining, 2010, pp. 446–451.
[4] W. He, S. Zha, and L. Li. "Social media competitive analysis and text mining: A case study in the pizza industry". International Journal of Information Management, vol. 33, no. 3, pp. 464–472, Jun. 2013.
[5] H. A. Vamerzani and M. Khademi. "Exploring the Uses and Challenges of Big Data in Opinion Analysis," in Proceedings of the 7th Iranian Conference on Electrical and Electronics Engineering, Gonabad, Islamic Azad University of Gonabad, 2016.
[6] M. Kiani nejad, T. hashemi, and M. rashidi. " Text mining social networks for consumer brand feelings and desires," in Proceedings of the 6th International Conference on Economics, Management and Engineering Sciences, Belgium, International Center for Academic Communication, 2016.
[7] Iran Analytical News Agency, "In which countries do telegram messengers favor?", khabaronline.ir, July. 2, 2019. [Online]. Available: khabaronline.ir/news/1275665. [Accessed:4 Jan 2020].
[8] Wikipedia contributors, "Telegram (software)," Wikipedia, The Free Encyclopedia, 27 Dec 2019, 15:24 UTC. [Online].Available: https://b2n.ir/907494.[Accessed:4 Jan 2020].
[9] Economics News, "Latest statistics from the mostpopular social networks in Iran", eghtesadnews.com, April. 9, 2019. [Online]. Available: https://b2n.ir/661242. [Accessed:4 Jan 2020].
[10] M. Nekkaa and D. Boughaci. "Hybrid harmony search combined with stochastic local search for feature selection". Neural Processing Letters, vol. 44, no. 1, pp. 199–220, 2016.
[11] X. Deng, Y. Li, J. Weng, and J. Zhang. "Feature selection for text classification: A review". Multimedia Tools and Applications, vol. 78, no. 3, pp. 3797–3816, 2019.
[12] A. K. Uysal. "An improved global feature selection scheme for text classification". Expert systems with Applications, vol. 43, pp. 82–92, 2016.
[13] L. M. Abualigah, A. T. Khader, M. A. Al-Betar, and O. A. Alomari. "Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering". Expert Systems with Applications, vol. 84, pp. 24–36, 2017.
[14] D. Agnihotri, K. Verma, and P. Tripathi. "Variable global feature selection scheme for automatic classification of text documents". Expert Systems with Applications, vol. 81, pp. 268–281, 2017.
[15] G. BİRİCİK, B. Diri, and A. C. SÖNMEZ. "Abstract feature extraction for text classification". Turkish Journal of Electrical Engineering & Computer Sciences, vol. 20, no. Sup. 1, pp. 1137–1159, 2012.
[16] P. Lachheta and S. Bawa. "Combining synthetic minority oversampling technique and subset feature selection technique for class imbalance problem", in Proceedings of the International Conference on Advances in Information Communication Technology & Computing, 2016, p. 25.
[17] A. F. Sheta and A. Alamleh. "A professional comparison of c4. 5, mlp, svm for network intrusion detection based feature analysis", in The International Congress for global Science and Technology, 2015, vol. 47, p. 15.
[18] F. Aragón-Royón, A. Jiménez-Vílchez, A. Arauzo-Azofra, and J. M. Benítez. "FSinR: an exhaustive package for feature selection". arXiv preprint arXiv:2002.10330, 2020.
[19] A.-Z. Ala’M, A. A. Heidari, M. Habib, H. Faris, I. Aljarah, and M. A. Hassonah. "Salp Chain-Based Optimization of Support Vector Machines and Feature Weighting for Medical Diagnostic Information Systems", in Evolutionary Machine Learning Techniques, Springer, 2020, pp. 11–34.
[20] O. Stromann, A. Nascetti, O. Yousif, and Y. Ban. "Dimensionality Reduction and Feature Selection for Object-Based Land Cover Classification based on Sentinel-1 and Sentinel-2 Time Series Using Google Earth Engine". Remote Sensing, vol. 12, no. 1, p. 76, 2020.
[21] D. Ö. Şahin and E. Kılıç. "Two new feature selection metrics for text classification". Automatika, vol. 60, no. 2, pp. 162–171, 2019.
[22] M. A. Hassonah, R. Al-Sayyed, A. Rodan, A.-Z. Ala’M, I. Aljarah, and H. Faris, "An efficient hybrid filter and evolutionary wrapper approach for sentiment analysis of various topics on Twitter". Knowledge-Based Systems, vol. 192, p. 105353, 2020.
[23] Y. Piao et al., "A new ensemble method with feature space partitioning for high-dimensional data classification". Mathematical Problems in Engineering, vol. 2015, 2015.
[24] Y. B. Wah, N. Ibrahim, H. A. Hamid, S. Abdul-Rahman, and S. Fong. "Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy. ". Pertanika Journal of Science & Technology, vol. 26, no. 1, 2018.
[25] A. K. Uysal. "On two-stage feature selection methods for text classification". IEEE Access, vol. 6, pp. 43233–43251, 2018.
[26] J. Xie and C. Wang. "Using support vector machines with a novel hybrid feature selection method for diagnosis of erythemato-squamous diseases". Expert Systems with Applications, vol. 38, no. 5, pp. 5809–5815, 2011.
[27] H. Ogura, H. Amano, and M. Kondo. "Distinctive characteristics of a metric using deviations from Poisson for feature selection". Expert Systems with Applications, vol. 37, no. 3, pp. 2273–2281, 2010.
[28] C. Huang, J. Zhu, Y. Liang, M. Yang, G. P. C. Fung, and J. Luo. "An efficient automatic multiple objectives optimization feature selection strategy for internet text classification". International Journal of Machine Learning and Cybernetics, vol. 10, no. 5, pp. 1151–1163, 2019.
[29] Z. Zheng and R. Srihari. "Optimally combining positive and negative features for text categorization", in ICML 2003 Workshop, 2003.
[30] A. K. Uysal and S. Gunal. "A novel probabilistic feature selection method for text classification". Knowledge-Based Systems, vol. 36, pp. 226–235, 2012.
[31] A. Melo and H. Paulheim. "Local and global feature selection for multilabel classification with binary relevance". Artificial intelligence review, vol. 51, no. 1, pp. 33–60, 2019.
[32] M. mojaveriyan, H. Ebrahimpour-Komleh, and S. jalaleddin Mousavirad. "Text Feature Selection using Document Frequency and Colonial Competitive Algorithm", in 8th National Conference on Data Mining, At Amirkabir University of Technology, Tehran, Iran, 2014.
[33] Ö. Uncu and I. B. Türkşen. "A novel feature selection approach: combining feature wrappers and filters". Information Sciences, vol. 177, no. 2, pp. 449–466, 2007.
[34] Y. Zhou, G. Cheng, S. Jiang, and M. Dai, "Building an Efficient Intrusion Detection System Based on Feature Selection and Ensemble Classifier". Computer Networks, p. 107247, 2020.
[35] V. Bolon-Canedo, N. Sanchez-Marono, and A. Alonso-Betanzos. "Feature selection and classification in multiple class datasets: An application to KDD Cup 99 dataset". Expert Systems with Applications, vol. 38, no. 5, pp. 5947–5957, 2011.
[36] A. Onan, S. Korukoğlu, and H. Bulut. "Ensemble of keyword extraction methods and classifiers in text classification". Expert Systems with Applications, vol. 57, pp. 232–247, 2016.
[37] K. Kurniabudi, A. Harris, and A. Rahim. "Seleksi Fitur Dengan Information Gain Untuk Meningkatkan Deteksi Serangan DDoS menggunakan Random Forest". Techno. Com, vol. 19, no. 1, pp. 56–66, 2020.
[38] T. Z. Win and N. S. M. Kham. "Information Gain Measured Feature Selection to Reduce High Dimensional Data", in Seventeenth International Conference on Computer Applications (ICCA 2019), 2019.
[39] B. Z. Abbasi, S. Hussain, S. Bibi, and M. A. Shah. "Impact of Membership and Non-membership Features on Classification Decision: An Empirical Study for Appraisal of Feature Selection Methods", in 2018 24th International Conference on Automation and Computing (ICAC), 2018, pp. 1–6.
[40] G. Kou, P. Yang, Y. Peng, F. Xiao, Y. Chen, and F. E. Alsaadi. "Evaluation of feature selection methods for text classification with small datasets using multiple criteria decision-making methods". Applied Soft Computing, vol. 86, p. 105836, 2020.
[41] A. K. Uysal and S. Gunal. "A novel probabilistic feature selection method for text classification". Knowledge-Based Systems, vol. 36, pp. 226–235, 2012.
[42] B. Tang, S. Kay, and H. He. "Toward optimal feature selection in naive Bayes for text categorization". IEEE transactions on knowledge and data engineering, vol. 28, no. 9, pp. 2508–2521, 2016.
[43] K. D. Rosa and J. Ellen. "Text classification methodologies applied to micro-text in military chat", in 2009 International Conference on Machine Learning and Applications, 2009, pp. 710–714.
[44] D. Sarkar. "Text Classification", in Text Analytics with Python, Springer, 2019, pp. 275–342.
[45] S. A. Verma, G. T. Thampi, and M. Rao. "Efficacy of a Classical and a Few Modified Machine Learning Algorithms in Forecasting Financial Time Series", in Internet of Things, Smart Computing and Technology: A Roadmap Ahead, Springer, 2020, pp. 3–30.
[46] M. Swamynathan. Mastering machine learning with python in six steps: A practical implementation guide to predictive data analytics using python. Apress, 2019.