الگوریتم نیمه نظارتی جمعی با استفاده از معیار انتخاب مبتنی بر آستانه امتیاز اطمینان در جریان-داده های غیر ایستا
محورهای موضوعی : مهندسی برق و کامپیوترشیرین خضری 1 , جعفر تنها 2 , علی احمدی 3 , آرش شريفي 4
1 - دانشگاه آزاد اسلامی واحد علوم و تحقیقات تهران
2 - دانشگا آمستردام هلند
3 - دانشگاه اوساکا ژاپن
4 - دانشگاه آزاد اسلامی واحد علوم و تحقیقات تهران
کلید واژه: الگوریتمهای طبقهبندی نیمهنظارتیمعیار انتخابمدلهای طبقهبندی جمعیتغییر مفهومجریانکاوی داده,
چکیده مقاله :
در این مقاله، یک الگوریتم طبقهبندی نیمهنظارتی جمعی با استفاده از معیار انتخاب مبتنی بر آستانه امتياز اطمينان تحت عنوان SSE-CBS در محیطهای غیر ایستا ارائه میشود. رویکرد پیشنهادی از دادههای دارای برچسب و فاقد برچسب با هدف مقابله با انواع تغییر مفهوم در جریان دادهها استفاده میکند. SSE-CBS مکانیزم مشهور وزندهی بر اساس دقت الگوریتمهای جمعی مبتنی بر بلوک را با ماهیت افزایشی الگوریتم درخت هافدینگ تلفیق میکند. الگوریتم پیشنهادی به طور تجربی با 8 رویکرد منطبق بر جدیدترین دستاوردها، از جمله مدلهای طبقهبندی نظارتی، نیمهنظارتی، منفرد و الگوریتمهای جمعی مبتنی بر بلوک روی مجموعه دادههای متنوع مقایسه شده است. بر اساس نتایج تجربی، SSE-CBS بهترین میانگین دقت طبقهبندی را نسبت به سایر رویکردهای نیمهنظارتی داراست و قادر است در محیطهای دارای تغییر مفهوم با محدودیت داده برچسبدار عملکرد مناسبی داشته باشد.
In this article, we propose a novel Semi-Supervised Ensemble classifier using Confidence Based Selection metric, named SSE-CBS. The proposed approach uses labeled and unlabeled data, which aims at reacting to different types of concept drift. SSE-CBS combines an accuracy-based weighting mechanism known from block-based ensembles with the incremental nature of Hoeffding Tree. The proposed algorithm is experimentally compared to the state-of-the-art stream methods, including supervised, semi-supervised, single classifiers, and block-based ensembles in different drift scenarios. Out of all the compared algorithms, SSE-CBS outperforms other semi-supervised ensemble approaches. Experimental results show that SSE-CBS can be considered suitable for scenarios, involving many types of drift in limited labeled data.
[1] S. Krishnaswamy, J. Gama, and M. M. Gaber, "Mobile data stream mining: from algorithms to applications," in Proc. IEEE 13th Int. Conf. Mob. Data Manag., pp. 360-363, Bengaluru, India, 23-26 Jul. 2012.
[2] M. M. Gaber, J. Gama, S. Krishnaswamy, J. B. Gomes, and F. Stahl, "Data stream mining in ubiquitous environments: state-of-the-art and current directions," Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 4, no. 2, pp. 116-138, Feb. 2014.
[3] H. L. Nguyen, Y. K. Woon, and W. K. Ng, "A survey on data stream clustering and classification," Knowl. Inf. Syst., vol. 45, no. 3, pp. 535-569, Dec. 2015.
[4] S. Ramirez-Gallego, B. Krawczyk, S. Garcia, M. Wozniak, and F. Herrera, "A survey on data preprocessing for data stream mining: current status and future directions," Neurocomputing, vol. 239, pp. 39-57, May 2017.
[5] S. Wares, J. Isaacs, and E. Elyan, "Data stream mining: methods and challenges for handling concept drift," SN Appl. Sci., vol. 1, no. 11, pp. 1-19, Nov. 2019,.
[6] H. M. Gomes, N. Zealand, and A. Bifet, "Machine learning for streaming data : state of the art, challenges, and opportunities," ACM SIGKDD Explorations Newsletter, vol. 21, no. 2, pp. 6-22, Nov. 2019.
[7] C. Woolam, M. M. Masud, and L. Khan, "Lacking labels in the stream: classifying evolving stream data with few labels," in Proc. Int. Symp. on Methodologies for Intelligent Systems, pp. 552-562, Sept. 2009.
[8] M. M. Masud, et al., "Facing the reality of data stream classification: coping with scarcity of labeled data," Knowledge and Information Systems, vol.33, no. 1, pp. 213-244, Oct. 2012.
[9] A. Bifet and R. Gavalda, "Learning from time-changing data with adaptive windowing," in Proc. 7th SIAM Int. Conf. Data Min., vol. 5, pp. 443-448, Minneapolis, MN, USA, Apr, 2007.
[10] A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavalda, "New ensemble methods for evolving data streams," in Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 139-147, Paris, France, Jun. 2009.
[11] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, "Learning with drift detection," in Proc. Brazilian Symp. Artif. Intell., pp. 286-295, Sao Luis, Brazil, 29 Sept.-1 Oct. 2004.
[12] H. Wang, W. Fan, P. Yu, and J. Han, "Mining concept-drifting data streams using ensemble classifiers," in Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD'03, vol. 2, pp. 226-235, Washington, D.C., USA, Aug. 2003.
[13] N. C. Oza and S. Russell, "Experimental comparisons of online and batch versions of bagging and boosting," in Proc. 7th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD'01, pp. 359-364, San Francisco, CA, Aug. 2001.
[14] W. N. Street and Y. Kim, "A streaming ensemble algorithm (SEA) for large-scale classification," in Proc. 7th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. - KDD'01, pp. 377-382, Aug. 2001.
[15] A. Bifet, G. Holmes, and B. Pfahringer, "Leveraging bagging for evolving data streams," in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics, vol. LNAI 6321, pt. 1, pp. 135-150, Barcelona, Spain, 19-23 Sept. 2010.
[16] J. Tanha, M. Van Someren, and H. Afsarmanesh, "Boosting for multiclass semi-supervised learning," Pattern Recognit. Lett., vol. 37, no. 1, pp. 63-77, Feb. 2014.
[17] J. Tanha, M. Van Someren, and H. Afsarmanesh, "Semi-supervised self-training for decision tree classifiers," Int. J. Mach. Learn. Cybern., vol. 8, pp. 355-370, Jan. 2017.
[18] A. Blum and T. Mitchell, "Combining labeled and unlabeled data with co-training," in Proc. Elev. Annu. Conf. Comput. Learn. Theory, COLT'98, pp. 92-100, Madison, WI, USA, Jul. 1998.
[19] J. Tanha, "MSSBoost: a new multiclass boosting to semi-supervised learning," Neurocomputing, vol. 314, pp. 251-266, Nov. 2018.
[20] R. S. Ferreira, G. Zimbrao, and L. G. M. Alvim, "AMANDA: semi-supervised density-based adaptive model for non-stationary data with extreme verification latency," Inf. Sci., vol. 488, pp. 219-237, Jul. 2019.
[21] K. B. Dyer, R. Capo, and R. Polikar, "Compose: a semisupervised learning framework for initially labeled nonstationary streaming data," IEEE Trans. Neural Networks Learn. Syst., vol. 25, no. 1, pp. 12-26, Jan. 2014.
[22] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, "A survey on concept drift adaptation," ACM Comput. Surv., vol. 46, no. 4, Article No. 44, 37 pp., Mar. 2014.
[23] M. Althabiti and M. Abdullah, "Streaming data classification with concept drift," Biosci. Biotech. Res. Comm. Special Issue, vol. 12, no. 1, pp. 177-184, Jan. 2019.
[24] S. Khezri, J. Tanha, A. Ahmadi, and A. Sharifi, "STDS: self-training data streams for mining limited labeled data in non-stationary environment," Appl. Intell., vol. 50, pp. 1448-1467, Jan. 2020.
[25] P. Domingos and G. Hulten, "Mining high-speed data streams," in Proc. of the 6th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD’00, pp. 71-80, Aug. 2000.
[26] G. Hulten, L. Spencer, and P. Domingos, "Mining time-changing data streams," in Proc. of the 7th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, KDD’01, pp. 97-106, Aug. 2001.
[27] J. Kolter and M. Maloof, "Dynamic weighted majority: an ensemble method for drifting concepts," J. Mach. Learn. Res., vol. 8, pp. 2755-2790, 2007.
[28] R. Kirkby, Improving Hoeffding Trees, PhD Thesis, Department of Computer Science, the University of Waikato, 2007.
[29] R. Elwell and R. Polikar, "Incremental learning of concept drift in nonstationary environments," IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517-1531, Oct. 2011.
[30] K. Nishida, K. Yamauchi, and T. Omori, "ACE: adaptive classifiers-ensemble system for concept-drifting environments," in Proc. of the 6th Int. Conf. on Multiple Classifier Systems, MCS’05, pp. 176-185, Jun. 2005.
[31] D. Brzezinski and J. Stefanowski, "Reacting to different types of concept drift: the accuracy updated ensemble," IEEE Trans. on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 81-94, Jan. 2014.
[32] D. Brzezinski and J. Stefanowski, "Accuracy updated ensemble for data streams with concept drift," in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. LNAI 6679, pt. 2, pp. 155-163, 2011.
[33] H. M. Gomes, et al., "Adaptive random forests for evolving data stream classification," Mach. Learn., vol. 106, no. 9-10, pp. 1469-1495, Oct. 2017.
[34] R. Klinkenberg, "Learning drifting concepts: example selection vs. example weighting," Intell. Data Anal., vol. 8, no. 3, pp. 281-300, Aug. 2004.
[35] D. Kifer, S. Ben-David, and J. Gehrke, "Detecting change in data streams," in Proc. 30th Int. Conf. Very Large Data Bases Conf., pp. 180-191, Aug. 2004,.
[36] A. Bifet and R. Gavald, "Learning from Time-Changing Data with Adaptive Windowing a".
[37] A. Bifet, et al., "Early drift detection method," in Proc. 4th ECML PKDD Int. Work. Knowl. Discov. from Data Streams, vol. 6, pp. 77-86, 2006.
[38] M. M. Masud, J. Gao, L. Khan, J. Han, and B. Thuraisingham, "A practical approach to classify evolving data streams: training with limited amount of labeled data," in Proc. IEEE Int. Conf. Data Mining, ICDM'08, pp. 929-934, Pisa, Italy, 15-19 Dec. 2008.
[39] M. J. Hosseini, A. Gholipour, and H. Beigy, "An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams," Knowl. Inf. Syst., vol. 46, pp. 567-597, 2016.
[40] Y. Wang and T. Li, "Improving semi-supervised co-forest algorithm in evolving data streams," Applied Intelligence, vol. 48, pp. 3248-3262, 2018
[41] M. Li and Z. H. Zhou, "Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples," IEEE Trans. Syst. Man, Cybern. A Syst. Humans, vol. 37, no. 6, pp. 1088-1098, Nov. 2007.
[42] A. Bifet, et al., "MOA: massive online analysis, a framework for stream classification and clustering," HaCDAIS, vol. 11, pp. 44-51, , Aachen, Germany, Sept.2010.