Subpopulation Discovery in Epidemiological Data with Subspace Clustering

Open access


A prerequisite of personalized medicine is the identification of groups of people who share specific risk factors towards an outcome. We investigate the potential of subspace clustering for finding such groups in epidemiological data. We propose a workflow that encompasses clusterability assessment before cluster discovery and quality assessment after learning the clusters. Epidemiological usually do not have a ground truth for the verification of clusters found in subspaces. Hence, we introduce quality assessment through juxtaposition of the learned models to “models-of-randomness”, i.e. models that do not reflect a true cluster structure. On the basis of this workflow, we select subspace clustering methods, compare and discuss their performance. We use a dataset with hepatic steatosis as outcome, but our findings apply on arbitrary epidemiological cohort data that have tenths of variables and exhibit class skew.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] B. Preim P. Klemm H. Hauser K. Hegenscheid S. Oeltze K. Toennies and H. Völzke Visualization in Medicine and Life Sciences III ch. Visual Analytics of Image-Centric Cohort Studies in Epidemiology. Springer 2014.

  • [2] A. D. Hingorani D. A. van der Windt R. D. Riley (...) W. Sauerbrei D. G. Altman and H. Hemingway “Prognosis research strategy (PROGRESS) 4: Stratified medicine research” BMJ: British Medical Journal vol. 346 no. e5793 2013.

  • [3] H. Völzke C. Schmidt K. Hegenscheid J. Kühn F. Bamberg W. Lieb H. Kroemer N. Hosten and R. Puls “Population imaging as valuable tool for personalized medicine” Clin Pharmacol Ther vol. 92 no. 4 pp. 422-424 2012.

  • [4] H. Völzke D. Alte . . . R. Biffar U. John and W. Hoffmann “Cohort profile: the Study of Health In Pomerania” International Journal of Epidemiology vol. 40 no. 2 pp. 294-307 2011.

  • [5] L. Parsons E. Haque and H. Liu “Subspace Clustering for High Dimensional Data: A Review” ACM SIGKDD Explorations Newsletter vol. 6 pp. 90-105 2004.

  • [6] K. Sim V. Gopalkrishnan A. Zimek and G. Cong “A survey on enhanced subspace clustering” Data mining and knowledge discovery vol. 26 pp. 332-397 2013.

  • [7] A. Zimek Data Clustering: Algorithms and Applications ch. Clustering High- Dimensional Data pp. 201-230. CRC Press 2013.

  • [8] C. Zhang and R. L. Kodell “Subpopulation-specific confidence designation for more informative biomedical classification” Artificial Intelligence in Medicine vol. 58 no. 3 pp. 155-163 2013.

  • [9] S. Glaßer U. Niemann B. Preim and M. Spiliopoulou “Can we Distinguish Between Benign and Malignant Breast Tumors in DCE-MRI by Studying a Tumor's Most Suspect Region Only?” in 26th International Symposium on Computer- Based Medical Systems (CBMS) pp. 77-82 2013.

  • [10] U. Niemann H. Völzke J.-P. Kühn and M. Spiliopoulou “Learning and inspecting classification rules from longitudinal epidemiological data to identify predictive features on hepatic steatosis” Expert Systems with Applications vol. 41 pp. 5405-5415 September 2014.

  • [11] T. Hielscher M. Spiliopoulou H. Völzke and J.-P. Kühn “Using participant similarity for the classification of epidemiological data on hepatic steatosis” in Proc. of the 27th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS'14) pp. 1-7 IEEE 2014.

  • [12] M. A. Hall “Correlation-based feature selection for discrete and numeric class machine learning” in Proc. of 17th Int. Conf. on Machine Learning pp. 359-366 Morgan Kaufmann 2000.

  • [13] P. Klemm L. Frauenstein D. Perlich K. Hegenscheid H. Völzke and B. Preim “Clustering Socio-demographic and Medical Attribute Data in Cohort Studies” in Bildverarbeitung für die Medizin (BVM) pp. 180-185 Springer Berlin Heidelberg 2014.

  • [14] R. Agrawal J. Gehrke D. Gunopulos and P. Raghavan “Automatic subspace clustering of high dimensional data for data mining applications” in Proceedings of the ACM International Conference on Management of Data (SIGMOD) pp. 61-72 1998.

  • [15] C. C. Aggarwal C. Procopiuc J. L. Wolf P. S. Yu and J. S. Park “Fast Algorithms for Projected Clustering” in Proceedings of the ACM International Conference on Management of Data (SIGMOD) pp. 61-72 1999.

  • [16] D. Damian M. Orešič E. Verheij J. Meulman J. Friedman A. Adourian N. Morel A. Smilde and J. van der Greef “Applications of a new subspace clustering algorithm (COSA) in medical systems biology” Metabolomics vol. 3 no. 1 pp. 69-77 2007.

  • [17] L. S. Friedman and E. B. Keeffe Handbook of Liver Disease. Library of Congress Cataloging-in-Publication Data 2011.

  • [18] A. P. Levene and R. D. Goldin “The epidemiology pathogenesis and histopathology of fatty liver disease” Histopathology vol. 61 pp. 141-152 2012.

  • [19] S. Bellentani G. Bedogni L.Miglioli and C. Tiribelli “The epidemiology of fatty liver” European Journal of Gastroenterology & Hepatology vol. 16 pp. 1087-1093 2004.

  • [20] G. Bedogni S. Bellentani L. Miglioli F. Masutti M. Passalacqua A. Castiglione and C. Tiribelli “The Fatty Liver Index: a simple and accurate predictor of hepatic steatosis in the general population” BMC Gastroenterology vol. 6 no. 33 2006.

  • [21] X. Yuan D. Waterworth J. R. Perry (...) T. M. Frayling J. S. Kooner and V. Mooser “Impact of fatty liver disease on health care utilization and costs in a general population: A 5-year observation” Gastroenterology vol. 134 no. 1 pp. 85-94 2008.

  • [22] H. Völzke S. Schwarz S. E. Baumeister H. Wallaschofski C. Schwahn H. J. Grabe T. Kohlmann U. John and M. Dören “Menopausal status and hepatic steatosis in a general female population” Gut vol. 56 pp. 594-595 2007.

  • [23] S. Baumeister H. Völzke P. Marschall U. John C. Schmidt and D. Alte “Impact of fatty liver disease on health care utilization and costs in the general population: a 5-year observation” Gastroenterology vol. 134 pp. 85-94 2008.

  • [24] J.-P. Kühn D. Hernando B. Mensel (...) J. Mayerle N. Hosten and S. B. Reeder “Quantitative chemical shift-encoded MRI is an accurate method to quantify hepatic steatosis” Journal of Magnetic Resonance Imaging vol. 39 no. 6 pp. 1494-1501 2014.

  • [25] J. Han M. Kamber and J. Pei Data Mining: Concepts and Techniques Third Edition. Morgan Kaufmann Publishers 2012.

  • [26] M. Ester H.-P. Kriegel J. Sander and X. Xu “A density-based algorithm for discovering clusters in large spatial databases with noise” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96) pp. 226-231 1996.

  • [27] R. A. Fisher “The use of multiple measurements in taxonomic problems” Annals of eugenics vol. 7 no. 2 pp. 179-188 1936.

  • [28] V. G. Sigillito S. P. Wing L. V. Hutton and K. B. Baker “Classification of radar returns from the ionosphere using neural networks” Johns Hopkins APL Tech. Dig vol. 10 pp. 262-266 1989.

  • [29] D. Dias R. Madeo T. Rocha H. Biscaro and S. Peres “Hand movement recognition for brazilian sign language: A study using distance-based neural networks” in International Joint Conference on Neural Networks (IJCNN 2009) pp. 697-704 2009.

  • [30] K. Kailing H.-P. Kriegel and P. Kröger “Density-Connected Subspace Clustering for High-Dimensional Data” in Proc. SIAM Int. Conf. on Data Mining (SDM'04) pp. 246-257 2004.

  • [31] I. Assent R. Krieger E. Müller and T. Seidl “DUSC: Dimensionality Unbiased Subspace Clustering” in ICDM pp. 409-414 2007.

  • [32] U. Niemann “The potential of high-dimensional clustering for subpopulation discovery in epidemiological datasets.” Otto-von-Guericke University Magdeburg Faculty of Computer Science 2014. Master Thesis.

  • [33] D. R. Wilson and T. R. Martinez “Improved heterogeneous distance functions” J. Artif. Int. Res. vol. 6 pp. 1-34 Jan. 1997.

  • [34] P.-N. Tan M. Steinbach and V. Kumar Introduction to Data Mining. Pearson/Addison-Wesley 2006.

  • [35] P. J. Hanly and S. B. Ahmed “Sleep Apnea and the Kidney: is sleep apnea a risk factor for chronic kidney disease?” CHEST Journal vol. 146 no. 4 pp. 1114-1122 2014.

  • [36] J. Zhao “Subspace clustering with gravitation.” in Grundlagen von Datenbanken 2010.

  • [37] J. Zhao and S. Conrad “Automatic subspace clustering with density function.” in DATA pp. 63-69 2012.

  • [38] E. Müller I. Assent S. Günnemann R. Krieger and T. Seidl “Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data” in Ninth IEEE International Conference on Data Mining (ICDM'09) pp. 377-386 IEEE 2009.

  • [39] S. Günnemann E. Müller I. Färber and T. Seidl “Detection of orthogonal concepts in subspaces of high dimensional data” in Proceedings of the 18th ACM conference on Information and knowledge management pp. 1317-1326 ACM 2009.

  • [40] G. Moise and J. Sander “Finding non-redundant statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining: ACM pp. 533-541 2008.

  • [41] U. Fayyad and K. Irani “Multi-interval discretization of continuous-valued attributes for classification learning” in Proc. of 17th Int. Conf. on Machine Learning pp. 1022-1029 Morgan Kaufmann 1993.

  • [42] M. J. Zaki M. Peters I. Assent and T. Seidl “Clicks: An effective algorithm for mining subspace clusters in categorical datasets” Data & Knowledge Engineering vol. 60 no. 1 pp. 51-70 2007.

  • [43] G. Gan and J. Wu “Subspace clustering for high dimensional categorical data” ACM SIGKDD Explorations Newsletter vol. 6 no. 2 pp. 87-94 2004.

  • [44] E. Müller I. Assent and T. Seidl “HSM: Heterogeneous subspace mining in high dimensional data” in Scientific and Statistical Database Management pp. 497-516 Springer 2009.

  • [45] F. Cao J. Liang D. Li and X. Zhao “A weighting k-modes algorithm for subspace clustering of categorical data” Neurocomputing vol. 108 pp. 23-30 2013.

  • [46] I. Färber S. Günnemann H.-P. Kriegel P. Kröger E. Müller E. Schubert T. Seidl and A. Zimek “On using class-labels in evaluation of clusterings” in MultiClust: 1st International Workshop on Discovering Summarizing and Using Multiple Clusterings Held in Conjunction with KDD 2010.

Journal information
Impact Factor

CiteScore 2018: 0.61

SCImago Journal Rank (SJR) 2018: 0.152
Source Normalized Impact per Paper (SNIP) 2018: 0.463

Mathematical Citation Quotient (MCQ) 2017: 0.02

Cited By
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 230 128 8
PDF Downloads 100 66 1