Graph Based Feature Selection Using Symmetrical Uncertainty in Microarray Dataset
Subject Areas : Data MiningSoodeh Bakhshandeh 1 , azmi azmi 2 , Mohammad Teshnehlab 3
1 - Science and Research Branch, Islamic Azad University
2 -
3 - K.N.Toosi University of Technology
Keywords: Gene selection, , Microarray data, , Filter method, , Graph-based clustering, , Feature Selection, ,
Abstract :
Microarray data with small samples and thousands of genes makes a difficult challenge for researches. Using gene selection in microarray data helps to select the most relevant genes from original dataset with the purpose of reducing the dimensionality of the microarray data as well as increasing the prediction performance. In this paper, a new gene selection method is proposed based on community detection technique and ranking the best genes. Symmetric Uncertainty is used for selection of the best genes by calculation of similarity between two genes and between each gene and class label which leads to representation of search space as a graph, in the first step. Afterwards, the proposed graph is divided into several clusters using community detection algorithm and finally, after ranking the genes, the genes with maximum ranks are selected as the best genes. This approach is a supervised/unsupervised filter-based gene selection method that minimizes the redundancy between genes and maximizes the relevance of genes and class label. Performance of the proposed method is compared with thirteen well-known unsupervised/supervised gene selection approaches over six microarray datasets using four classifiers including SVM, DT, NB and k-NN. Results show the advantages of the proposed approach.
A. Salimi, M. Ziaii, A. Amiri, M. H. Zadeh, S. Karimpouli, and M. Moradkhani, “Using a feature subset selection method and support vector machine to address curse of dimensionality and redundancy in hyperion hyperspectral data classification,” The Egyptian Journal of Remote Sensing and Space Science, vol. 21, no. 1, pp. 27–36, 2018.
[2] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” science, vol. 286, no. 5439, pp. 531–537, 1999.
[3] H. Liu and H. Motoda, Computational Methods of Feature Selection (Chapman & Hall/Crc Data Mining and Knowledge Discovery Series). Chapman and Hall, 2007.
[4] A. Jovi´c, K. Brki´c, and N. Bogunovi´c, “A review of feature selection methods with applications,” in 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE, 2015, pp. 1200–1205.
[5] B.Liao,Y.Jiang,W.Liang,W.Zhu,L.Cai,andZ.Cao, “Gene selection using locality sensitive laplacian score,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 6, pp. 1146–1156, Nov 2014.
[6] S. Theodoridis and K. Koutroumbas, Pattern Recognition, fourth edition. Academic Press, Oxford, 2008.
[7] R. Cai, Z. Hao, X. Yang, and W. Wen, “An efficient gene selection algorithm based on mutual information,” Neurocomputing, vol. 72, no. 4, pp. 991 – 999, 2009.
[8] L. E. Raileanu and K. Stoffel, “Theoretical comparison between the gini index and information gain criteria,” Annals of Mathematics and Artificial Intelligence, vol. 41, no. 1, pp. 77–93, May 2004.
[9] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1226– 1238, Aug 2005.
[10] L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,” Journal of Machine Learning Research, vol. 5, pp. 1205–1224, 2004.
[11] A. J. Ferreira and M. A. Figueiredo, “Efficient feature selection filters for high-dimensional data,” Pattern Recognition Letters, vol. 33, no. 13, pp. 1794 – 1804, 2012.
[12] S. Tabakhi, P. Moradi, and F. Akhlaghian, “An unsupervised feature selection algorithm based on ant colony optimization,” Engineering Applications of Artificial Intelligence, vol. 32, no. Supplement C, pp. 112 – 123, 2014.
[13] C. Lai, M. J. Reinders, and L. Wessels, “Random subspace method for multivariate feature selection,” Pattern Recognition Letters, vol. 27, no. 10, pp. 1067 – 1076, 2006.
[14] R. J. Manoj, M. A. Praveena, and K. Vijayakumar, “An aco– ann based feature selection algorithm for big data,” Cluster Computing, pp. 1–8, 2018.
[15] I. Jain, V. K. Jain, and R. Jain, “Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification,” Applied Soft Computing, vol. 62, pp. 203–215, 2018.
[16] S. Maldonado, R. Weber, and J. Basak, “Simultaneous feature selection and classification using kernel-penalized support vector machines,” Information Sciences, vol. 181, no. 1, pp. 115 – 128, 2011.
[17] G. Wang, Q. Song, B. Xu, and Y. Zhou, “Selecting feature subset for high dimensional data via the propositional foil rules,” Pattern Recognition, vol. 46, no. 1, pp. 199 – 214, 2013.
[18] J. Canul-Reich, L. O. Hall, D. Goldgof, J. N. Korecki, and S. Eschrich, “Iterative feature perturbation as a gene selector for microarray data,” vol. 26, 08 2012.
[19] P. A. Mundra and J. C. Rajapakse, “Svm-rfe with mrmr filter for gene selection,” IEEE Transactions on NanoBioscience, vol. 9, no. 1, pp. 31–37, March 2010.
[20] L.-Y. Chuang, C.-H. Yang, K.-C. Wu, and C.-H. Yang, “A hybrid feature selection method for dna microarray data,” Computers in Biology and Medicine, vol. 41, no. 4, pp. 228 – 237, 2011.
[21] W. Zhao, G. Wang, H. Wang, H. Chen, H. Dong, and Z. Zhao, “A novel framework for gene selection,” vol. 3, pp. 184–191, 04 2011.
[22] G. Forman, “An extensive empirical study of feature selection metrics for text classification,” J. Mach. Learn. Res., vol. 3, pp. 1289–1305, march 2003.
[23] K. Kira and L. A. Rendell, “The feature selection problem: Traditional methods and a new algorithm,” in Proceedings of the Tenth National Conference on Artificial Intelligence, ser. AAAI’92, 1992, pp. 129–134.
[24] I. Kononenko, Estimating attributes: Analysis and extensions of RELIEF.
[25] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proceedings, Twentieth International Conference on Machine Learning, vol. 2, 01 2003, pp. 856–863.
[26] R. Battiti, “Using mutual information for selecting features in supervised neural net learning,” IEEE Transactions on Neural Networks, vol. 5, no. 4, pp. 537–550, July 1994.
[27] Q. Song, J. Ni, and G. Wang, “A fast clustering-based feature subset selection algorithm for high-dimensional data,” IEEE Trans. on Knowl. and Data Eng., vol. 25, pp. 1–14, jan 2013.
[28] M. Mandal and A. Mukhopadhyay, Unsupervised Nonredundant Feature Selection: A Graph-Theoretic Approach. Springer Berlin Heidelberg, 2013, pp. 373–380.
[29] S. Bandyopadhyay, T. Bhadra, P. Mitra, and U. Maulik, “Integration of dense subgraph finding with feature clustering for unsupervised feature selection,” Pattern Recognition Letters, vol. 40, no. Supplement C, pp. 104 – 112, 2014.
[30] S. Tabakhi, A. Najafi, R. Ranjbar, and P. Moradi, “Gene selection for microarray data classification using a novel ant colony optimization,” Neurocomputing, vol. 168, no. Supplement C, pp. 1024 – 1036, 2015.
[31] P. Moradi and M. Rostami, “Integration of graph clustering with ant colony optimization for feature selection,” Knowledge-Based Systems, vol. 84, no. Supplement C, pp. 144 – 161, 2015.
[32] A. Pino Angulo, “Gene selection for microarray cancer data classification by a novel rule-based algorithm,” Information, vol. 9, no. 1, p. 6, 2018.
[33] S. S. Kannan and N. Ramaraj, “A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm,” Knowledge-Based Systems, vol. 23, no. 6, pp. 580–585, 2010.
[34] K. Zheng and X. Wang, “Feature selection method with joint maximal information entropy between features and class,” Pattern Recognition, vol. 77, pp. 20–29, 2018.
[35] U. M. Fayyad and K. B. Irani, “Multi-interval discretization of continuous-valued attributes for classification learning,” in IJCAI, 1993, pp. 1022–1029.
[36] C. Shi, Y. Cai, D. Fu, Y. Dong, and B. Wu, “A link clustering based overlapping community detection algorithm,” Data and Knowledge Engineering, vol. 87, no. Supplement C, pp. 394 – 404, 2013.
[37] E. Le Martelot and C. Hankin, “Fast multi-scale detection of relevant communities in large-scale networks,” The Computer Journal, vol. 56, no. 9, pp. 1136–1150, 2013.
[38] V. D. Blondel, J. loup Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” pp. 1–12, 2008.
[39] C. Gao, D. Wei, Y. Hu, S. Mahadevan, and Y. Deng, “A modified evidential methodology of identifying influential nodes in weighted networks,”Physica A:Statistical Mechanics and its Applications, vol. 392, no. 21, pp. 5490–5500, 2013.
[40] M. R. G. R. . V. L. V. Nicosia, R. Criado, “Controlling centrality in complex networks,” Scientific Reports 2, no. 218, pp. 218–223, 2012.
[41] M. Campiteli, A. Holanda, L. Soares, P. Soles, and O. Kinouchi, “Lobby index as a network centrality measure,” Physica A: Statistical Mechanics and its Applications, vol. 392, pp. 5511 – 5515, 2013.
[42] Y. Du, C. Gao, Y. Hu, S. Mahadevan, and Y. Deng, “A new method of identifying influential nodes in complex networks based on topsis,” Physica A: Statistical Mechanics and its Applications, vol. 399, no. Supplement C, pp. 57 – 69, 2014.
[43] P. Mitra, C. A. Murthy, and S. K. Pal, “Unsupervised feature selection using feature similarity,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 301–312, March 2002.
[44] M. P. V. S. Kirkpatrick, C. D. Gelatt Jr., “Optimization by simulated annealing,” American Association for the Advancement of Science, vol. 220, no. 4598, pp. 671–680, may 1983.
[45] K. R. B. D. S. Repository, kent ridge bio-medical dataset, http://datam.i2r.a-star.edu.sg/datasets/krbd/.
[46] B. institute, Cancer Program Data Sets, 2014, http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi.
[47] I. T. G. A. Statnikov, C.F. Aliferis, Gene Expression Model Selector, 2005, http://www.gems-system.org.
[48] M. Haindl, P. Somol, D. Ververidis, and C. Kotropoulos, Feature Selection Based on Mutual Correlation. Springer Berlin Heidelberg, 2006, pp. 569–577.
[49] P. Moradi and M. Rostami, “A graph theoretic approach for unsupervised feature selection,” Engineering Applications of Artificial Intelligence, vol. 44, no. Supplement C, pp. 33 – 45, 2015.
[50] A. K. Das, S. Goswami, A. Chakrabarti, and B. Chakraborty, “A new hybrid feature selection approach using feature association map for supervised and unsupervised classification,” Expert Systems with Applications, vol. 88, no. Supplement C, pp. 81 – 94, 2017.
[51] A. Brankovic, M. Hosseini, and L. Piroddi, “A distributed feature selection algorithm based on distance correlation with an application to microarrays,” IEEE/ACM transactions on computational biology and bioinformatics, 2018.
[52] L. Sun, X. Kong, J. Xu, R. Zhai, S. Zhang et al., “A hybrid gene selection method based on relieff and ant colony optimization algorithm for tumor classification,” Scientific Reports, vol. 9, no. 1, p. 8978, 2019.
[53] M. Moradkhani, A. Amiri, M. Javaherian, and H. Safari, “A hybrid algorithm for feature subset selection in high dimensional datasets using fica and iwssr algorithm,” Applied Soft Computing, vol. 35, pp. 123–135, 2015.
[54] P. Bermejo, J. A. Gamez, and J. M. Puerta, “A grasp algorithm for fast hybrid (filter-wrapper) feature subset selection in high dimensional datasets,” Pattern Recognition Letters, vol. 32, no. 5, pp. 701–711, 2011.
[55] X. Huang, L. Zhang, B. Wang, F. Li, and Z. Zhang, “Feature clustering based support vector machine recursive feature elimination for gene selection,” Applied Intelligence, vol. 48, no. 3, pp. 594–607, 2018.
[56] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, pp. 389–422, Jan 2002.
[57] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, March 1986.
[58] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, pp. 37–66, Jan 1991.
[59] Hamdani and R. Wardoyo, “The complexity calculation for group decision making using topsis algorithm,” AIP Conference Proceedings, vol. 1755, pp. 0700071 – 0700077, 2016.
[60] T. J. Cleophas and A. H. Zwinderman, “Quantile-quantile plots, a good start for looking at your medical data (50 cholesterol measurements and 58 patients),” in Machine Learning in Medicine-a Complete Overview. Springer, 2015, pp. 253– 259.