راهکاری مبتنی بر ساخت درخت دودویی تقریبی برای سرعتبخشیدن به جستجوی نزدیکترین همسایگی در دادههای حجیم
الموضوعات :
1 - دانشگاه تربیت دبیر شهید رجائی،دانشكده مهندسي كامپيوتر
2 - دانشگاه تربیت دبیر شهید رجائی،دانشکده مهندسی کامپیوتر
الکلمات المفتاحية: بافر همپوشانی, دادههای حجیم, درخت تصمیم دودویی, طبقهبندی نزدیکترین همسایگی,
ملخص المقالة :
با توجه به سرعت روزافزون تولید اطلاعات و نیاز تبدیل اطلاعات به دانش، روشهای یادگیری ماشین قدیمی دیگر پاسخگو نیستند. هنگام استفاده از طبقهبندیها با روشهای یادگیری ماشین قدیمی، به ویژه استفاده از طبقهبندیهای ذاتاً تنبل مانند روش k- نزدیکترین همسایگی (KNN)، عملیات طبقهبندی دادههای حجیم بسیار کند است. نزدیکترین همسایگی به دلیل سادگی و دقت عملی که ارائه میدهد یک روش محبوب در زمینه طبقهبندی دادهها میباشد. روش پیشنهادی مبتنی بر مرتبسازی بردارهای ویژگی دادههای آموزشی در یک درخت جستجوی دودویی است تا طبقهبندی دادههای بزرگ را با استفاده از روش نزدیکترین همسایگی تسریع بخشد. این کار با استفاده از یافتن تقریبی دو دورترین داده محلی در هر گره درخت انجام میشود. این دو داده به عنوان معیار برای تقسیم دادههای موجود در گره فعلی بین دو گروه، مورد استفاده قرار میگیرند. مجموعه دادههای موجود در هر گره بر اساس شباهت آنها به این دو داده، به فرزند چپ یا راست گره فعلی تخصیص داده میشوند. نتایج آزمایشهای متعدد انجامشده بر روی مجموعه دادههای مختلف از مخزن UCI، میزان دقت خوب با توجه به زمان اجرای کم روش پیشنهادی را نشان میدهد.
[1] X. Lv, "The big data impact and application study on the like ecosystem construction of open internet of things," Cluster Computing, vol. 22, no. 2, pp. 3563-3572, Mar. 2019.
[2] V. Marx, Biology: the Big Challenges of Big Data, Ed: Nature Publishing Group, 2013.
[3] I. Triguero, J. Maillo, J. Luengo, S. García, and F. Herrera, "From big data to smart data with the k-nearest neighbours algorithm," in Proc. IEEE Int. Conf. on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 859-864, Chengdu, China, 15- 18 Dec. 2016.
[4] D. W. Aha, D. Kibler, and M. K. Albert, "Instance-based learning algorithms," Machine Learning, vol. 6, no. 1, pp. 37-66, Jan. 1991.
[5] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Trans. on Information Theory, vol. 13, no. 1, pp. 21-27, Jan. 1967.
[6] D. W. Aha, Lazy Learning, in Lazy learning: Kluwer Academic Publishers, pp. 7-10, 1997.
[7] P. P. Anchalia and K. Roy, "The k-nearest neighbor algorithm using mapreduce paradigm," in Proc. 5th IEEE Int. Conf. on Intelligent Systems, Modelling and Simulation, pp. 513-518, Langkawi, Malaysia, 27- 9 Jan. 2014.
[8] H. Saadatfar, S. Khosravi, J. H. Joloudari, A. Mosavi, and S. Shamshirband, "A new k-nearest neighbors classifier for big data based on efficient data pruning," Mathematics, vol. 8, no. 2, p. 286, Feb. 2020.
[9] A. Hassanat, "Norm-based binary search trees for speeding up knn big data classification," Computers, vol. 7, no. 4, pp. 5, Oct. 2018. [10] D. Zhu, "Humor robot and humor generation method based on big data search through IOT," Cluster Computing, vol. 22, no. 4, pp. 9169-9175, Jul. 2019.
[11] J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R. Rico-Juan, "Improving kNN multi-label classification in prototype selection scenarios using class proposals," Pattern Recognition, vol. 48, no. 5, pp. 1608-1622, May 2015.
[12] X. Wu, X. Zhu, G. Q. Wu, and W. Ding, "Data mining with big data," IEEE Trans. on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97-107, Jun. 2013.
[13] S. García, J. Luengo, and F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.
[14] L. Micó and J. Oncina, "A constant average time algorithm to allow insertions in the LAESA fast nearest neighbour search index," in Proc. 20th IEEE Int. Conf. on Pattern Recognition, pp. 3911-3914, Istanbul, Turkey, 23- 26 Aug. 2010.
[15] A. Bifet, et al., "Extremely fast decision tree mining for evolving data streams," in Proc. of the 23rd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 1733-1742, Halifax, Canada, 13-17 Aug. 2017.
[16] A. B. Hassanat, "Furthest-pair-based binary search tree for speeding big data classification using k-nearest neighbors," Big Data, vol. 6, no. 3, pp. 225-235, Sept 2018.
[17] X. Wu, et al., "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37, Jan. 2008.
[18] E. Plaku and L. E. Kavraki, "Distributed computation of the kNN graph for large high-dimensional point sets," J. of Parallel and Distributed Computing, vol. 67, no. 3, pp. 346-359, Mar. 2007.
[19] C. Zhang, F. Li, and J. Jestes, "Efficient parallel kNN joins for large data in mapreduce," in Proc. of the 15th Int. Conf. on Extending Database Technology, pp. 38-49, Berlin, Germany, 27-30 Mar. 2012.
[20] K. Sun, H. Kang, and H. H. Park, "Tagging and classifying facial images in cloud environments based on KNN using mapreduce," Optik, vol. 126, no. 21, pp. 3227-3233, Nov. 2015.
[21] J. Maillo, S. Ramírez, I. Triguero, and F. Herrera, "kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data," Knowledge-Based Systems, vol. 117, no. 1, pp. 3-15, Feb. 2017.
[22] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak, J. M. Benítez, and F. Herrera, "Nearest neighbor classification for high-speed big data streams using spark," IEEE Trans. on Systems, Man, and Cybernetics: Systems, vol. 47, no. 10, pp. 2727-2739, Jul. 2017.
[23] G. Chatzigeorgakidis, S. Karagiorgou, S. Athanasiou, and S. Skiadopoulos, "FML-kNN: scalable machine learning on big data using k-nearest neighbor joins," J. of Big Data, vol. 5, no. 1, pp. 1-27, Feb. 2018.
[24] S. Bagui, A. K. Mondal, and S. Bagui, "Improving the performance of kNN in the mapreduce framework using locality sensitive hashing," International J. of Distributed Systems and Technologies, vol. 10, no. 4, pp. 1-16, Oct. 2019.
[25] Z. Yong, L. Youwen, and X. Shixiong, "An improved KNN text classification algorithm based on clustering," J. of Computers, vol. 4, no. 3, pp. 230-237, Mar. 2009.
[26] H. Neeb and C. Kurrus, Distributed k-Nearest Neighbors, Ed: Stanford University Publishing: Stanford, CA, USA, 2016.
[27] Y. Ben-Haim and E. Tom-Tov, "A streaming parallel decision tree algorithm," J. of Machine Learning Research, vol. 11, no. 2, pp. 849-872, Feb. 2010.
[28] D. Xia, H. Li, B. Wang, Y. Li, and Z. Zhang, "A map reduce-based nearest neighbor approach for big-data-driven traffic flow prediction," IEEE Access, vol. 4, pp. 2920-2934, Jun. 2016.
[29] F. Angiulli, "Fast condensed nearest neighbor rule," in Proc. of the 22nd Int. Conf. on Machine learning, pp. 25-32, New York, NY, USA, 7-11Aug. 2005.
[30] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, "Efficient kNN classification algorithm for big data," Neurocomputing, vol. 195, no. 100, pp. 143-148, Jun. 2016.
[31] T. Seidl and H. P. Kriegel, "Optimal multi-step k-nearest neighbor search," SIGMOD Rec., vol. 27, no. 2, pp. 154-165, Jun. 1998.
[32] L. R. Lu and H. Y. Fa, "A density-based method for reducing the amount of training data in kNN text classification," J. of Computer Research and Development, vol. 45, no. 4, pp. 539–544, Aug. 2004.
[33] A. J. Gallego, J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R. Rico-Juan, "Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation," Pattern Recognition, vol. 74, no. 1, pp. 531-543, Feb. 2018.
[34] A. J. Gallego, J. R. Rico-Juan, and J. J. Valero-Mas, "Efficient k-nearest neighbor search based on clustering and adaptive k values," Pattern Recognition, vol. 122, Article ID: 108356, Feb. 2022.
[35] S. Ougiaroglou and G. Evangelidis, "RHC: a non-parametric cluster-based data reduction for efficient k kNN classification," Pattern Analysis and Applications, vol. 19, no. 1, pp. 93-109, Feb. 2016.
[36] F. Wang, Q. Wang, F. Nie, W. Yu, and R. Wang, "Efficient tree classifiers for large scale datasets," Neurocomputing, vol. 284, no. 2, pp. 70-79, Apr. 2018.
[37] A. Hassanat, "Greedy algorithms for approximating the diameter of machine learning datasets in multidimensional euclidean space: experimental results," 2018.
[38] A. B. Hassanat, "Two-point-based binary search trees for accelerating big data classification using KNN," PloS One, vol. 13, no. 11, Article ID: e0207772, Nov. 2018.
[39] T. Liu, A. W. Moore, K. Yang, and A. G. Gray, "An investigation of practical approximate nearest neighbor algorithms," in Proc. 17th Int. Conf. on Information Processing Systems, pp. 825-832, Vancouver, Canada,13-18 Dec. 2004.