Automatic Error Detecting in Databases, Based on Clustering and Nearest Neighbor
Subject Areas : electrical and computer engineeringM. ataeyan 1 , n. daneshpour 2
1 -
2 -
Keywords: Data cleaning automatic error detection clustering k-means,
Abstract :
Data quality affects on companies decision making, so that decisions based on data without quality incur companies high costs. Data quality has various dimensions and accuracy is the most important of these dimensions. Error detection is needed for data cleaning. Due to the huge volume of data, an automatic system is needed to perform this process without user interaction. In this paper an approach is proposed based on k-means clustering for error detection. Firstly data are clustered for each attribute. Then for each data in each cluster a method similar to k-nearest neighbor is used for detecting errors. The proposed method is able to detect multiple errors in one record. Also this approach is able to detect errors in fields with various attribute types. Experimental results show that this approach can detect 91% of errors in data on average. Also the proposed approach is compared with an automatic method which detects errors based on rule in various attribute types. Experimental results show that the proposed approach has on average 25%better performance to detect errors.
[1] G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin, "Sampling from repairs of conditional functional dependency violations," The VLDB Journal, vol. 23, no. 1, pp. 103-128, Feb, 2014.
[2] W. Fan, "Dependencies revisited for improving data quality," in Proc. 27th Int. Conf. on Management of Data, pp. 159-170, Vancouver, Canada, 9-12 Jun. 2008.
[3] W. Ahmed Malik and A. Unwin, "Automated error detection using association rules," Intelligent Data Analysis, vol. 15, no. 5, pp. 749-761, Sept. 2011.
[4] P. H. Williams, C. R. Margules, and D. W. Hilbert, "Data requirements and data sources for biodiversity priority area selection," J. of Biosciences, vol. 27, no. 4, pp. 327-338, Jul. 2002.
[5] S. Bruggemann, "Rule mining for automatic ontology based data cleaning," in Progress in WWW Research and Development, pp. 522-527, 2008.
[6] G. Rahman and Z. Islam, "Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques," Knowledge-Based Systems, vol. 53, pp. 51-65, Nov. 2013.
[7] G. Rahman and Z. Islam, "Decision tree-based missing value imputation technique for data pre-processing," Research and Practice in Information Technology, vol. 121, no. 1, pp. 41-50, Dec. 2011.
[8] L. Breiman, "Bagging predictors," Machine Learning, vol. 24, no. 2, pp. 123-140, Aug. 1996.
[9] M. Yakout and L. Berti-Equille, and A. K. Elmagarmid, "Don't be SCAREd: use scalable automatic repairing with maximal likelihood and bounded changes," in Proc. 13th Int. Conf. on Management of Data, pp. 553-564, New York, USA, 22-27 Jun. 2013.
[10] N. Tang, "Big data cleaning," in Proc. 16th Int. Conf.in Web Technologies and Applications, pp. 13-24, Changsha, China, 5-7 Sept. 2014.
[11] J. Hipp, U. Guntzer, and U. Grimmer, "Data quality mining-making a virute of necessity," in Proc. 6th Int. SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD'01, pp. 52-57, Santa Barbara, California, USA, May, 2001.
[12] C. He, Z. Tan, Q. Chen, C. Sha, Z. Wang, and W. Wang, "Repair diversification for functional dependency violations," in Proc. 19th Int. Conf.in Database Systems for Advanced Applications,, pp. 468-482, Bali, Indonesia, 21-24 April, 2014.
[13] M. Hamad and A. Abdulkhar Jihad, "An enhanced technique to clean data in the data warehouse," in Proc. 11thInt. Conf. in Developments in E-systems Engineering, pp. 306-311, Washington, DC, USA, 6-8 Dec. 2011.
[14] C. Teng, "Correcting noisy data," in Proc. 16th Int. Conf.in Machine Learning,, pp. 239-248, San Francisco, CA, USA, 27-30 Jun. 1999.
[15] C. Teng, "A comparison of noise handling techniques," in Proc. 14th Int. Florida Artificial Intelligence Research Society, pp. 269-273, Key West, FL, USA, 21 – 23 May, 2001.
[16] C. Teng, "Polishing blemishes: issues in data correction," Intelligent Systems, vol. 19, no. 2, pp. 34-39, Mar. 2004.
[17] A. Lopatenko and L. Bravo, "Efficient approximation algorithms for repairing inconsistent databases," in Proc. IEEE 23rd Int. Conf. on Data Engineering, ICDE'07, pp. 216-225, 15-20 Apr. 2007.
[18] V. J. Hodge and J. Austin, "A survey of outlier detection methodologies," Artificial Intelligence Review, vol. 22, no. 2, pp. 85-126, Oct. 2004.
[19] S. Chawla and A. Gionis, "k-means: a unified approach to clustering and outlier detection," in Proc. 13th SIAM Int. Conf. on Data Mining, pp. 189-197, Austin, Texas, USA, 2-4 May 2013.
[20] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Perez, and I. Perona, "An extensive comparative study of cluster validity indices," Pattern Recognition, vol. 46, no. 1, pp. 243-256, Jan. 2013.
[21] P. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," J. of Computational and Applied Mathematics, vol. 20, no. 1, pp. 53-65, Nov. 1987.
[22] J. Han, M. Kamber, and J. Pei, Data Mining Concept and Technieques, pp. 451-471, 3 Edition, 2011.