راهکاری مبتنی بر ساخت درخت دودویی تقریبی برای سرعت‌بخشیدن به جستجوی نزدیک‌ترین همسایگی در داده‌های حجیم

الموضوعات : electrical and computer engineering

1 - دانشگاه تربیت دبیر شهید رجائی،دانشكده مهندسي كامپيوتر
2 - دانشگاه تربیت دبیر شهید رجائی،دانشکده مهندسی کامپیوتر

تاريخ الإرسال : 05 الإثنين , جمادى الثانية, 1442 تاريخ التأكيد : 11 الثلاثاء , رمضان, 1443 تاريخ الإصدار : 08 الأربعاء , ربيع الثاني, 1444

الکلمات المفتاحية: بافر همپوشانی, داده‌های حجیم, درخت تصمیم دودویی, طبقه‌بندی نزدیک‌ترین همسایگی,

ملخص المقالة :

با توجه به سرعت روزافزون تولید اطلاعات و نیاز تبدیل اطلاعات به دانش، روش‌های یادگیری ماشین قدیمی دیگر پاسخگو نیستند. هنگام استفاده از طبقه‌بندی‌ها با روش‌های یادگیری ماشین قدیمی، به ویژه استفاده از طبقه‌بندی‌های ذاتاً تنبل مانند روش k- نزدیک‌ترین همسایگی (KNN)، عملیات طبقه‌بندی داده‌های حجیم بسیار کند است. نزدیک‌ترین همسایگی به دلیل سادگی و دقت عملی که ارائه می‌دهد یک روش محبوب در زمینه طبقه‌بندی داده‌ها می‌باشد. روش پیشنهادی مبتنی بر مرتب‌سازی بردارهای ویژگی داده‌های آموزشی در یک درخت جستجوی دودویی است تا طبقه‌بندی داده‌های بزرگ را با استفاده از روش نزدیک‌ترین همسایگی تسریع بخشد. این کار با استفاده از یافتن تقریبی دو دورترین داده محلی در هر گره درخت انجام می‌شود. این دو داده به عنوان معیار برای تقسیم داده‌های موجود در گره فعلی بین دو گروه، مورد استفاده قرار می‌گیرند. مجموعه داده‌های موجود در هر گره بر اساس شباهت آنها به این دو داده، به فرزند چپ یا راست گره فعلی تخصیص داده می‌شوند. نتایج آزمایش‌های متعدد انجام‌شده بر روی مجموعه داده‌های مختلف از مخزن UCI، میزان دقت خوب با توجه به زمان اجرای کم روش پیشنهادی را نشان می‌دهد.

المصادر:

[1] X. Lv, "The big data impact and application study on the like ecosystem construction of open internet of things," Cluster Computing, vol. 22, no. 2, pp. 3563-3572, Mar. 2019.
[2] V. Marx, Biology: the Big Challenges of Big Data, Ed: Nature Publishing Group, 2013.
[3] I. Triguero, J. Maillo, J. Luengo, S. García, and F. Herrera, "From big data to smart data with the k-nearest neighbours algorithm," in Proc. IEEE Int. Conf. on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 859-864, Chengdu, China, 15- 18 Dec. 2016.
[4] D. W. Aha, D. Kibler, and M. K. Albert, "Instance-based learning algorithms," Machine Learning, vol. 6, no. 1, pp. 37-66, Jan. 1991.
[5] T. Cover and P. Hart, "Nearest neighbor pattern classification," IEEE Trans. on Information Theory, vol. 13, no. 1, pp. 21-27, Jan. 1967.
[6] D. W. Aha, Lazy Learning, in Lazy learning: Kluwer Academic Publishers, pp. 7-10, 1997.
[7] P. P. Anchalia and K. Roy, "The k-nearest neighbor algorithm using mapreduce paradigm," in Proc. 5th IEEE Int. Conf. on Intelligent Systems, Modelling and Simulation, pp. 513-518, Langkawi, Malaysia, 27- 9 Jan. 2014.
[8] H. Saadatfar, S. Khosravi, J. H. Joloudari, A. Mosavi, and S. Shamshirband, "A new k-nearest neighbors classifier for big data based on efficient data pruning," Mathematics, vol. 8, no. 2, p. 286, Feb. 2020.
[9] A. Hassanat, "Norm-based binary search trees for speeding up knn big data classification," Computers, vol. 7, no. 4, pp. 5, Oct. 2018. [10] D. Zhu, "Humor robot and humor generation method based on big data search through IOT," Cluster Computing, vol. 22, no. 4, pp. 9169-9175, Jul. 2019.
[11] J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R. Rico-Juan, "Improving kNN multi-label classification in prototype selection scenarios using class proposals," Pattern Recognition, vol. 48, no. 5, pp. 1608-1622, May 2015.
[12] X. Wu, X. Zhu, G. Q. Wu, and W. Ding, "Data mining with big data," IEEE Trans. on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97-107, Jun. 2013.
[13] S. García, J. Luengo, and F. Herrera, Data Preprocessing in Data Mining, Springer, 2015.
[14] L. Micó and J. Oncina, "A constant average time algorithm to allow insertions in the LAESA fast nearest neighbour search index," in Proc. 20th IEEE Int. Conf. on Pattern Recognition, pp. 3911-3914, Istanbul, Turkey, 23- 26 Aug. 2010.
[15] A. Bifet, et al., "Extremely fast decision tree mining for evolving data streams," in Proc. of the 23rd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 1733-1742, Halifax, Canada, 13-17 Aug. 2017.
[16] A. B. Hassanat, "Furthest-pair-based binary search tree for speeding big data classification using k-nearest neighbors," Big Data, vol. 6, no. 3, pp. 225-235, Sept 2018.
[17] X. Wu, et al., "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37, Jan. 2008.
[18] E. Plaku and L. E. Kavraki, "Distributed computation of the kNN graph for large high-dimensional point sets," J. of Parallel and Distributed Computing, vol. 67, no. 3, pp. 346-359, Mar. 2007.
[19] C. Zhang, F. Li, and J. Jestes, "Efficient parallel kNN joins for large data in mapreduce," in Proc. of the 15th Int. Conf. on Extending Database Technology, pp. 38-49, Berlin, Germany, 27-30 Mar. 2012.
[20] K. Sun, H. Kang, and H. H. Park, "Tagging and classifying facial images in cloud environments based on KNN using mapreduce," Optik, vol. 126, no. 21, pp. 3227-3233, Nov. 2015.
[21] J. Maillo, S. Ramírez, I. Triguero, and F. Herrera, "kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data," Knowledge-Based Systems, vol. 117, no. 1, pp. 3-15, Feb. 2017.
[22] S. Ramírez-Gallego, B. Krawczyk, S. García, M. Woźniak, J. M. Benítez, and F. Herrera, "Nearest neighbor classification for high-speed big data streams using spark," IEEE Trans. on Systems, Man, and Cybernetics: Systems, vol. 47, no. 10, pp. 2727-2739, Jul. 2017.
[23] G. Chatzigeorgakidis, S. Karagiorgou, S. Athanasiou, and S. Skiadopoulos, "FML-kNN: scalable machine learning on big data using k-nearest neighbor joins," J. of Big Data, vol. 5, no. 1, pp. 1-27, Feb. 2018.
[24] S. Bagui, A. K. Mondal, and S. Bagui, "Improving the performance of kNN in the mapreduce framework using locality sensitive hashing," International J. of Distributed Systems and Technologies, vol. 10, no. 4, pp. 1-16, Oct. 2019.
[25] Z. Yong, L. Youwen, and X. Shixiong, "An improved KNN text classification algorithm based on clustering," J. of Computers, vol. 4, no. 3, pp. 230-237, Mar. 2009.
[26] H. Neeb and C. Kurrus, Distributed k-Nearest Neighbors, Ed: Stanford University Publishing: Stanford, CA, USA, 2016.
[27] Y. Ben-Haim and E. Tom-Tov, "A streaming parallel decision tree algorithm," J. of Machine Learning Research, vol. 11, no. 2, pp. 849-872, Feb. 2010.
[28] D. Xia, H. Li, B. Wang, Y. Li, and Z. Zhang, "A map reduce-based nearest neighbor approach for big-data-driven traffic flow prediction," IEEE Access, vol. 4, pp. 2920-2934, Jun. 2016.
[29] F. Angiulli, "Fast condensed nearest neighbor rule," in Proc. of the 22nd Int. Conf. on Machine learning, pp. 25-32, New York, NY, USA, 7-11Aug. 2005.
[30] Z. Deng, X. Zhu, D. Cheng, M. Zong, and S. Zhang, "Efficient kNN classification algorithm for big data," Neurocomputing, vol. 195, no. 100, pp. 143-148, Jun. 2016.
[31] T. Seidl and H. P. Kriegel, "Optimal multi-step k-nearest neighbor search," SIGMOD Rec., vol. 27, no. 2, pp. 154-165, Jun. 1998.
[32] L. R. Lu and H. Y. Fa, "A density-based method for reducing the amount of training data in kNN text classification," J. of Computer Research and Development, vol. 45, no. 4, pp. 539–544, Aug. 2004.
[33] A. J. Gallego, J. Calvo-Zaragoza, J. J. Valero-Mas, and J. R. Rico-Juan, "Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation," Pattern Recognition, vol. 74, no. 1, pp. 531-543, Feb. 2018.
[34] A. J. Gallego, J. R. Rico-Juan, and J. J. Valero-Mas, "Efficient k-nearest neighbor search based on clustering and adaptive k values," Pattern Recognition, vol. 122, Article ID: 108356, Feb. 2022.
[35] S. Ougiaroglou and G. Evangelidis, "RHC: a non-parametric cluster-based data reduction for efficient k kNN classification," Pattern Analysis and Applications, vol. 19, no. 1, pp. 93-109, Feb. 2016.
[36] F. Wang, Q. Wang, F. Nie, W. Yu, and R. Wang, "Efficient tree classifiers for large scale datasets," Neurocomputing, vol. 284, no. 2, pp. 70-79, Apr. 2018.
[37] A. Hassanat, "Greedy algorithms for approximating the diameter of machine learning datasets in multidimensional euclidean space: experimental results," 2018.
[38] A. B. Hassanat, "Two-point-based binary search trees for accelerating big data classification using KNN," PloS One, vol. 13, no. 11, Article ID: e0207772, Nov. 2018.
[39] T. Liu, A. W. Moore, K. Yang, and A. G. Gray, "An investigation of practical approximate nearest neighbor algorithms," in Proc. 17th Int. Conf. on Information Processing Systems, pp. 825-832, Vancouver, Canada,13-18 Dec. 2004.

شارک

عنوان URL للمقالة

راهکاری مبتنی بر ساخت درخت دودویی تقریبی برای سرعت‌بخشیدن به جستجوی نزدیک‌ترین همسایگی در داده‌های حجیم

رایمگ

الروابط

المراكز ذات الصلة

دعامة

الصفحات الرسمية