Gene selection ensembles and classifier ensembles for medical diagnosis

Open access


The usefulness of combining methods is examined using the example of microarray cancer data sets, where expression levels of huge numbers of genes are reported. Problems of discrimination into two groups are examined on three data sets relating to the expression of huge numbers of genes. For the three examined microarray data sets, the cross-validation errors evaluated on the remaining half of the whole data set, not used earlier for the selection of genes, were used as measures of classifier performance. Common single procedures for the selection of genes—Prediction Analysis of Microarrays (PAM) and Significance Analysis of Microarrays (SAM)—were compared with the fusion of eight selection procedures, or of a smaller subset of five of them, excluding SAM or PAM. Merging five or eight selection methods gave similar results. Based on the misclassification rates for the three examined microarray data sets, for any examined ensemble of classifiers, the combining of gene selection methods was not superior to single PAM or SAM selection for two of the examined data sets. Additionally, the procedure of heterogeneous combining of five base classifiers—k-nearest neighbors, SVM linear and SVM radial with parameter c=1, shrunken centroids regularized classifier (SCRDA) and nearest mean classifier—proved to significantly outperform resampling classifiers such as bagging decision trees. Heterogeneously combined classifiers also outperformed double bagging for some ranges of gene numbers and data sets, but merging is generally not superior to random forests. The preliminary step of combining gene rankings was generally not essential for the performance for either heterogeneously or homogeneously combined classifiers.

Alon U., Barkai N., Notterman D.A., Gish K., Ybarra S., Mack D., Levine A.J. (1999): Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 96(12): 6745–50.

Benjamini Y, Hochberg Y. (1995): Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B 57: 289–300.

Breiman L. (1996): Bagging predictions. Machine Learning 24 (2): 123–140.

Breiman L. (2001): Random Forests. Machine Learning 45: 5–32.

Boulesteix A.L., Strobl C., Augustin T., Daumer M. (2008): Evaluating Microarray-based Classifiers: An Overview. Cancer Inform. 6: 77–97.

Chai H., Domeniconi C. (2004): An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification. In: Proc. 2nd European Workshop on Data Mining and Text Mining in Bioinformatics, 3–10.

Cohen J.D., Li Y., Wang C., Thoburn B., Afsari L. et al. (2018): Detection and localization of surgically resectable cancers with a multi-analyte blood test. Science 10.1126/science.aar3247

Cohen J.D., Javed A.A, Li C., Thoburn, Wonga F., Tie J., Gibbs P. et al. (2017): Combined circulating tumor DNA and protein biomarker-based liquid biopsy for the earlier detection of pancreatic cancers. Proc Natl Acad Sci USA 114 (38): 10202–10207.

Cortes C., Vapnik V. (1995): Support-Vector Networks. Machine Learning 20: 273–297.

Dettling M., Bühlmann P. (2003): Boosting for tumor classification with gene expression data. Bioinformatics 19 (9): 1061–1069.

Dettling M. (2004): BagBoosting for tumor classification with gene expression data. Bioinformatics: 20: 3583–3593.

van Delft J.H., van Agen E., van Breda S.G., Herwijnen M.H., Staal Y.C., Kleinjans J.C. (2005): Comparison of supervised clustering methods to discriminate genotoxic from non-genotoxic carcinogens by gene expression profiling. Mutat Res, 575(1–2): 17–33.

Ge Y., Dudoit S., Speed T.P. (2003): Resampling-based multiple testing for microarray data analysis. January 2003. Technical Report 633.

Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S (1999): Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439): 531–537.

Guo Y., Hastie T., Tibshirani R. (2005): Regularized Discriminant Analysis and Its Application in Microarrays. Biostatistics, 1(1): 1–18.

Hothorn T., Lausen B. (2003): Double-bagging: combining classifiers by bootstrap aggregation. Pattern Recognition 36 (2): 1303–1309.

Kumarasinghe N., Tooney P.A., Schall U. (2012): Finding the needle in the haystack: A review of microarray gene expression research into schizophrenia. Australian & New Zealand Journal of Psychiatry 46 (7): 598–610.

van Sanden S., Lin D., Burzykowski T. (2008): Performance of gene selection and classification methods in a microarray setting: A simulation study. Communications in Statistics – Simulation and Computation 37(2): 409–424.

Skurichina M., Duin R.P.W. (2002): Bagging, Boosting and the Random Subspace Method for Linear Classifiers. Pattern Analysis & Applications 5:121–135.

Tibshirani R., Hastie T., Narasimhan B., Chu G. (2002): Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS: 99: 6567–6572.

Tusher V., Tibshirani R., Chu G. (2001): Significance analysis of microarrays applied to the ionizing radiation response. PNAS 98: 5116–5121.

Westfall P.H., Zaykin D.V., Young S.S. (2001): Multiple tests for genetic effects in association studies. In: S. Looney (ed.), Methods in Molecular Biology 184: Biostatistical Methods, Humana Press, Toloway, NJ: 143–168.

Westfall P.H., Young S.S. (1993): Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons.

Biometrical Letters

The Journal of Polish Biometric Society

Journal Information


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 37 37 27
PDF Downloads 38 38 25