Resolving Class Imbalance in Medical Classification: Technique Comparison and Performance Evaluation
الموضوعات : Machine learningAbdallah Maiti 1 , Mohamed Hanini 2 , Abdallah Abarda 3
1 - Laboratory of Computing, Networks, Mobility and Modelling (IR2M) FST, Hassan First University of Settat, Morocco
2 - Laboratory of Computing, Networks, Mobility and Modelling (IR2M) FST, Hassan First University of Settat, Morocco
3 - Laboratory LM2CE, Faculty of Economic Sciences and Management, Hassan First University of Settat, Morocco
الکلمات المفتاحية: Data Imbalance, Techniques for Resolving Data Class Imbalance, Oversampling, Cost-Sensitive learning, Convolutional Neural Networks, Classification, Model Performance, Medical Diagnostics.,
ملخص المقالة :
The problem of unbalanced data is a common one in medical diagnostics. This problem can reduce the accuracy of classification models and affect the validity of results. The aim of our paper is to compare several techniques for correcting class imbalances in medical datasets and to evaluate the impact of these techniques on machine learning performance.
In our paper, we used an imbalanced dataset to train a convolutional neural network (CNN) model. We then tested correction techniques such as sampling and cost-sensitive learning. Finally, we used recall, precision, accuracy and F1 score to evaluate the model's performance.
The results show that the use of correction techniques led to a significant improvement in the performance of the classification model. The cost-sensitive learning technique gave the best results, particularly for the detection of minority classes. This method increased the weight of classification errors associated with minority classes, thus improving the detection of critical cases. The results of this study underline the importance of dealing with imbalances in the data to improve the performance of classification models in the medical field. The use of methods such as cost-sensitive learning not only improves model performance, but also enables more reliable decisions to be made, which is essential for ensuring more accurate diagnoses and better quality of care.
[1].KrawczykB, B. (2016). “Learning from imbalanced data: Open challenges and future directions”. Published in Progress in Artificial Intelligence, V5(4), pp 221-232.
[2].Haixiang, G., and al. (2017). “Learning from class-imbalanced data: Review of methods and applications”. Published in Expert Systems with Applications, v73, pp 220-239.
[3].LemaîtreG., Nogueira, F., and Aridas, C. K(2017). « Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning”. Published in Journal of Machine Learning Research, v18(17), pp1-5.
[4].BrancoP., Torgo, L., andRibeiro, R. P2019). A survey of predictive modeling on imbalanced domains. ACM Computing Surveys, v49(2), pp1-50.
[5].He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
[6].ChawlaN. V. et al2002). SMOTE: Synthetic Minority Over-sampling Technique. Published in Journal of Artificial Intelligence Research, 16, 321-357.
[7].Kaur, H. et al. (2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. Published in ACM computing surveys (CSUR), 52(4), 1-36.
[8].Abdullah, A. A., Mohammed, N. S., Khanzadi, M., Asaad, S. M., Abdul, Z. K., & Maghdid, H. S. (2025). In-depth Analysis on Machine Learning Approaches: Techniques, Applications, and Trends. ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 13(1), 190-202.
[9].Sabr, S. S., Mustafa, N. S., Omar, T. S., Rasool, S. H., Omer, N. A., Hamad, D. S., ... & Maghdid, H. S. (2025). A Comprehensive Part-of-Speech Tagging to Standardize Central-Kurdish Language: A Research Guide for Kurdish Natural Language Processing Tasks. arXiv preprint arXiv:2504.19645.
[10].Kaur, H., Pannu, H. S., and Malhi, A. K. (2019). A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM computing surveys (CSUR), 52(4), 1-36.
[11]. LinC. C., Yen, S. J., and Lee, Y. S2017). On combining SMOTE with under-sampling: An experimental study on class imbalance problem. Published in Information Sciences, v371, 123-137.
[12].YangC., at al. (2024). Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. Published in Journal of big data, v11(1), 7.
[13].Loffredo, E., Pastore, M., Cocco, S., & Monasson, R. (2024). Restoring balance: principled under/oversampling of data for optimal classification. arXiv preprint arXiv:2405.09535.
[14].Buda , M. , Maki, A., and Mazurowski, M. A. (2018). “A systematic study of the class imbalance problem in convolutional neural networks”. Neural Networks, 106, pp 249-259.
[15].Leevy, J. L., Khoshgoftaar, T. M., Bauder, R. A., & Seliya, N. (2018). A survey on addressing high-class imbalance in big data. Journal of Big Data, 5(1), 1-30.
[16].Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Advances in Intelligent Computing, 878-887.
[17].Liu, Y., Wu, T., & Yan, P. (2020). Balancing imbalanced data using adaptive synthetic sampling with feature selection. Computational Intelligence and Neuroscience, 2020, 1-11.
[18].Longadge, R., & Dongre, S. (2013). Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707.
[19].Brownlee, J. (2020). Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery.
[20].Yadav, S., & Bhole, G. P. (2020, December). Handling imbalanced dataset classification in machine learning. In 2020 IEEE Pune Section International Conference (PuneCon) (pp. 38-43). IEEE.
[21].Liu, L., Wu, X., Li, S., Li, Y., Tan, S., & Bai, Y. (2022). Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Medical Informatics and Decision Making, 22(1), 82.
[22].Maiti, A., Abarda, A., & Hanini, M. (2022, October). A New Hybrid Artificial Intelligence Model for Diseases Identification. In The Proceedings of the International Conference on Smart City Applications (pp. 825-836). Cham: Springer International Publishing.
[23].He, H., Garcia, E. A. (2009). “Learning from imbalanced data”. In IEEE Transactions on knowledge and data engineering, 21(9), pp1263-1284.
[24].Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N., & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification : an experimental review. Journal of Big data, 7, 1-47.
[25].Pawara, P., Okafor, E., Groefsema, M., He, S., Schomaker, L. R., & Wiering, M. A. (2020). One-vs-One classification for deep neural networks. Pattern Recognition, 108, 107528.
[26].Brownlee, J. (2020). One-vs-rest and one-vs-one for multi-class classification. Machine Learning Mastery.
[27].LiQ., SongY., ZhangJ., and ShengV. S2020). « Multiclass imbalanced learning with one-versus-one decomposition and spectral clustering”. Published in Expert Systems with Applications, in147, p113--152.
[28].Chakraborty, S., & Dey, L. (2024). Multi-class Classification. In Multi-objective, Multi-class and Multi-label Data Classification with Class Imbalance: Theory and Practices (pp. 51-76). Singapore : Springer Nature Singapore.
[29].Diabetic Retinopathy Detection data set, in kaggle.com/c/diabetic-retinopathy-detection/data.
[30].Maiti , A., Abarda, A., Hanini, M., and Oussous, A. (2024). ”An Optimal Model Combining SqueezeNet and Machine Learning Methods for Lung Disease Diagnosis. Current Medical Imaging, 20(1).
[31].Khan, A. A., Chaudhari, O., & Chandra, R. (2024). A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation. Expert Systems with Applications, 244, 122778.DOI : 10.1016/j.eswa.2023.122778.
[32].Araf, I., Idri, A., & Chairi, I. (2024). Cost-sensitive learning for imbalanced medical data: A review. Artificial Intelligence Review, 57(4), 80.DOI : 10.1007/s10462-023-10652-8.
[33].Vargas, W. de, Schneider Aranda, J. A., dos Santos Costa, R., da Silva Pereira, P. R., & Victória Barbosa, J. L. (2023). Imbalanced data preprocessing techniques for machine learning: A systematic mapping study. Knowledge and Information Systems, 65(1), 31-57.DOI : 10.1007/s10115-022-01772-8.
[34].Liang, G., & Zhang, C. (2012). A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification. In AI 2012: Advances in Artificial Intelligence (pp. 637–648). Springer.DOI : 10.1007/978-3-642-35101-3_54.
[35].Soleimani, M., & Mirshahzadeh, A. S. (2023). Multi-class classification of imbalanced intelligent data using deep neural network. EAI Endorsed Transactions on AI and Robotics, 2, 1-10.DOI : 10.4108/airo.7998.
[36].Chakraborty, S., & Dey, L. (2024). Applications of Multi-objective, Multi-label, and Multi-class Classifications. In Multi-objective, Multi-class and Multi-label Data Classification with Class Imbalance: Theory and Practices (pp. 135-164). Singapore: Springer Nature Singapore.
DOI : 10.1007/978-981-97-9622-9.
