Data mining methods for gene selection on the basis of gene expression arrays

Open access


The paper presents data mining methods applied to gene selection for recognition of a particular type of prostate cancer on the basis of gene expression arrays. Several chosen methods of gene selection, including the Fisher method, correlation of gene with a class, application of the support vector machine and statistical hypotheses, are compared on the basis of clustering measures. The results of applying these individual selection methods are combined together to identify the most often selected genes forming the required pattern, best associated with the cancerous cases. This resulting pattern of selected gene lists is treated as the input data to the classifier, performing the task of the final recognition of the patterns. The numerical results of the recognition of prostate cancer from normal (reference) cases using the selected genes and the support vector machine confirm the good performance of the proposed gene selection approach

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Baldi P. and Long A. (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inference of gene changes Bioinformatics 17(4): 509-519.

  • Chang C.-C. and Lin C.-J. (2011). LibSVM: A library for support vector machines ACM Transactions on Intelligent Systems and Technology 1(27): 1-27.

  • De Rinaldis E. (2007). DNA Microarrays: Current Applications Horizon Scientific Press Norfolk.

  • Duda R. Hart P. and Stork P. (2003). Pattern Classification and Scene Analysis John Wiley New York NY.

  • Eisen M. Spellman P. and Brown P. (1998). Cluster analysis and display of genome wide expression patterns Proceedings of the National Academy of Sciences 95(25): 14863-14868.

  • Fan R.-E. Chen P.-H. and Lin C.-J. (2005). Working set selection using second order information for training SVM Journal of Machine Learning Research 6(12): 1889-1918.

  • Furey T. Cristianini N. Duffy N. Bednarski D. Schummer M. and Haussler D. (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data Bioinformatics 16(10): 906-914.

  • Golub T. Slonim D.K. Tamayo P. Huard C. Gaasenbeek M. Mesirov J.P. Coller H. Loh M.L. Downing J.R. Caligiuri M.A. and Bloomfield C.D. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring Science 286(5439): 531-537.

  • Guyon I. and Elisseeff A. (2003). An introduction to variable and feature selection Journal of Machine Learning Research 3(3): 1158-1182.

  • Guyon I. Weston A. Barnhill S. and Vapnik V. (2002). Gene selection for cancer classification using SVM Machine Learning 46(1-3): 389-422.

  • Haykin S. (1999). Neural Networks. A Comprehensive Foundation 2nd Edition Prentice-Hall Englewood Cliffs NJ.

  • Herrero J. Valencia A. and Dopazon A. (2001). A hierarchical unsupervised growing neural network for clustering gene expression patterns Bioinformatics 17(2): 126-136.

  • Hewett R. and Kijsanayothin P. (2008). Tumor classification ranking from microarray data BMC Genomics 9(2): 1-11.

  • Huang T.M. and Kecman V. (2005). Gene extraction for cancer diagnosis by support vector machines-an improvement Artificial Intelligence in Medicine 9(35): 185-194.

  • Huang X. and Pan W. (2003). Linear regression and two-class classification with gene expression data Bioinformatics 19(16): 2072-2078.

  • Makinaci M. (2007). Support vector machine approach for classification of cancerous prostate regions World Academy of Science Engineering and Technology 1(7): 166-169.

  • Matlab (2012). Matlab User Manual-Statistics Toolbox MathWorks Natic.

  • Mitsubayashi H. Aso S. Nagashima T. and Okada Y. (2008). Accurate and robust gene selection for desease classification using a simple statistics Biomedical Informatics 3(2): 68-71.

  • Ramaswamy S. Tamayo P. Rifkin R. Mukherjee S. Yeang C. Angelo M. Ladd C. Reich M. Latulippe E. Mesirov J. Poggio T. Gerald W. Loda M. Lander E. and Golub T. (2001). Multiclass cancer diagnosis using tumor gene expression signatures Proceedings of the National Academy of Sciences 98(26): 15149-15154.

  • Sabo K. (2014). Center-based l1-clustering method International Journal of Applied Mathematics and Computer Science 24(1): 151-163 DOI: 10.2478/amcs-2014-0012.

  • Scholkopf B. and Smola A. (2002). Learning with Kernels MIT Press Cambridge MA.

  • Sprent P. and Smeeton N. (2007). Applied Nonparametric Statistical Methods Chapman and Hall-CRC Boca Raton FL. ´S winiarski R.W. (2001). Rough sets methods in feature reduction and classification International Journal of Applied Mathematics and Computer Science 11(3): 565-582.

  • Tan P.N. Steinbach M. and Kumar V. (2006). Introduction to Data Mining Pearson Education Boston MA.

  • Vanderbilt (2002). Data base of prostate cancer Vanderbilt University

  • Vert J. (2007). Kernel methods in genomics and computational biology in G. Camps-Valls J.L. Rojo-Alvarez and M. Martinez-Ramon (Eds.) Kernel Methods in Bioengineering Signal and Image Processing Idea Group London pp. 42-64.

  • Wang X. and Gotoh O. (2009). Cancer classification using single genes Genom Informatics 23(1): 179-188.

  • Wang X. and Gotoh O. (2010). A robust gene selection method for microarray-based cancer classification Cancer Informatics 9(2): 15-30.

  • Wiliński A. and Osowski S. (2012). Ensemble of data mining methods for gene ranking Bulletin of the Polish Academy of Sciences 60(3): 461-471.

  • Woolf P.J. and Wang Y. (2000). A fuzzy logic approach to analyzing gene expression data Physiological Genomics 3(1): 9-15.

  • Yang F. (2011). Robust feature selection for microarray data based on multicriterion fusion IEEE Transactions on Computational Biology and Bioinformatics 8(4): 1080-1092.

Journal information
Impact Factor

IMPACT FACTOR 2018: 1.504
5-year IMPACT FACTOR: 1.553

CiteScore 2018: 2.09

SCImago Journal Rank (SJR) 2018: 0.493
Source Normalized Impact per Paper (SNIP) 2018: 1.361

Mathematical Citation Quotient (MCQ) 2018: 0.08

Cited By
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 253 71 0
PDF Downloads 83 30 0