The Impact of Feature Selection on the Information Held in Bioinformatics Data

Open access

Abstract

The present research examines a wide range of attribute selection methods – 86 methods that include both ranking and subset evaluation approaches. The efficacy evaluation of these methods is carried out using bioinformatics data sets provided by the Latvian Biomedical Research and Study Centre. The data sets are intended for diagnostic task purposes and incorporate values of more than 1000 proteomics features as well as diagnosis (specific cancer or healthy) determined by a golden standard method (biopsy and histological analysis). The diagnostic task is solved using classification algorithms FURIA, RIPPER, C4.5, CART, KNN, SVM, FB+ and GARF in the initial and various sets with reduced dimensionality. The research paper finalises with conclusions about the most effective methods of attribute subset selection for classification task in diagnostic proteomics data.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] H. Liu and R. Setiono “Chi2: Feature selection and discretization of numeric attributes” in Proc. IEEE 7th Int. Conf. on Tools with Artificial Intelligence pp. 338–391 1995.

  • [2] J.R. Quinlan C4.5: Programs for Machine Learning. – San Mateo CA: Morgan Kaufmann Publishers 1993 p. 302.

  • [3] R.C. Holte. “Very simple classification rules perform well on most commonly used datasets” Machine Learning vol. 11 pp. 63–91 1993. http://dx.doi.org/10.1023/A:1022631118932

  • [4] I. Kononenko “Estimating Attributes: Analysis and Extensions of RELIEF” in European Conf. on Machine Learning pp. 171–182 1994. http://dx.doi.org/10.1007/3-540-57868-4_57

  • [5] M. A. Hall “Correlation-based Feature Subset Selection for Machine Learning” Dissertation at University of Waikato (Hamilton New Zealand) 1998. 198 p.

  • [6] C.P. Tan K.S. Lim W.K. Lai “Multi-Dimensional Features Reduction of Consistency Subset Evaluator on Unsupervised Expectation Maximization Classifier for Imaging Surveillance Application” Int. J. of Image Processing 2–1 pp. 18–26 2008.

  • [7] H. Liu R. Setiono “A probabilistic approach to feature selection – a filter solution” in Proc. of the 13th Int. Conf. on Machine Learning (ICML'96) Bari Italy July 3–6 1996. San Mateo: Morgan Kaufmann Pub. 1996 pp. 319–327.

  • [8] L. Yu H. Liu “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution” in Proc. of the Twentieth Int. Conf. on Machine Learning pp. 856–863 2003.

  • [9] P. Langley “Selection of relevant features in machine learning” in Proc. of the AAAI Fall Symposium on Relevance. New Orleans Louisiana USA Nov. 4–6 1994. New Orleans: AAAI Press pp. 140–144 1994.

  • [10] W. W. Cohen “Fast Effective Rule Induction” in Machine Learning: Proc. of the 12th Int. Conf. (ML’95) Morgan Kaufmann 1995 pp. 115–123. http://dx.doi.org/10.1016/b978-1-55860-377-6.50023-2

  • [11] J. Fürnkranz and G. Widmer “Incremental reduced error pruning” in W.W. Cohen and H. Hirsh editors Proc. of the 11th Int. Conference on Machine Learning pp. 70–77. Morgan Kaufmann 1994.

  • [12] R. Quinlan R. “Learning logical definitions from relations” Machine Learning vol. 5 no. 3 1990.

  • [13] R. Quinlan R. “Simplifying decision trees” International Journal of Man-Machine Studies vol. 27 pp. 221–234 1987. http://dx.doi.org/10.1016/S0020-7373(87)80053-6

  • [14] “The WEKA Data Mining Software: An Update” M. Hall E. Frank G. Holmes et al. ACM SIGKDD explorations newsletter 2009 vol. 11 issue 1 pp. 10–18.

  • [15] J. Hühn E. Hüllermeier E. “FURIA: An Algorithm for Unordered Fuzzy Rule Induction” Data Mining and Knowledge Discovery 2009 vol. 19 no. 3 pp. 293–319. http://dx.doi.org/10.1007/s10618-009-0131-8

  • [16] “Classification and Regression Trees” L. Breiman J.H. Fridman L.A. Olshen et al. – Washington DC: Chapman & Hall / CRC 1984 358 p. (Series: Wadsworth Statistics/Probability).

  • [17] Data Mining and Knowledge Discovery Handbook / Ed. O. Maimon L. Rokach. Berlin Heidelberg: Springer 2010 1285 p.

  • [18] D. W. Aha D. Kibler and M.K. Albert “Instance-Based Learning Algorithms” Mach. Learn. vol. 6 issue 1 Jan. 1991 pp. 37–66. http://dx.doi.org/10.1023/A:1022689900470

  • [19] D. Meyer “Support Vector Machines. The Interface to libsvm in package” e1071. Online-Documentation of the package e1071 for R. – Wien: Technische Universität Wien 2001. pp. 1–8. Available from: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=BCDB6D08469CF19CF416EAADC044C6B3?doi=10.1.1.151.5271&rep=rep1&type=pdf.

  • [20] V. Vapnik and A. Lerner “Pattern recognition using generalized portrait method” Automation and Remote Control. 24 pp. 774–780 1963.

  • [21] C. Cortes V. Vapnik “Support-Vector Network” Machine Learning 1995 vol. 20 pp. 273–297. http://dx.doi.org/10.1007/BF00994018

  • [22] H. Theron I. Cloete “BEXA: A Covering Algorithm for Learning Propositional Concept Descriptions” Machine Learning. 1996. Vol. 24 Issue 1. pp. 5–40. http://dx.doi.org/10.1007/BF00117830

  • [23] J. van Zyl “Fuzzy Set Covering as a New Paradigm for the Induction of Fuzzy Classification Rules” PhD thesis. – Mannheim: University of Mannheim 2007. 263 p.

  • [24] M. Gasparovica-Asite “Fuzzy classification methodology for processing and analyzing bioinformatics data” PhD thesis. Riga: Riga Technical University 2015. 160 p. in press.

  • [25] I. Poļaka A. Borisovs “Genethic Algorithm and Tree Based Classification in Bioinformatics” in European Conference on Data Analysis 2013: Book of Abstracts Luxembourg Luxembourg July 10–12 2013. Luxembourg: 2013 pp. 107–107. ISBN 9782879711058.

Search
Journal information
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 258 115 2
PDF Downloads 108 59 0