De-anonymizing Genomic Databases Using Phenotypic Traits

Open access


People increasingly have their genomes sequenced and some of them share their genomic data online. They do so for various purposes, including to find relatives and to help advance genomic research. An individual’s genome carries very sensitive, private information such as its owner’s susceptibility to diseases, which could be used for discrimination. Therefore, genomic databases are often anonymized. However, an individual’s genotype is also linked to visible phenotypic traits, such as eye or hair color, which can be used to re-identify users in anonymized public genomic databases, thus raising severe privacy issues. For instance, an adversary can identify a target’s genome using known her phenotypic traits and subsequently infer her susceptibility to Alzheimer’s disease. In this paper, we quantify, based on various phenotypic traits, the extent of this threat in several scenarios by implementing de-anonymization attacks on a genomic database of OpenSNP users sequenced by 23andMe. Our experimental results show that the proportion of correct matches reaches 23% with a supervised approach in a database of 50 participants. Our approach outperforms the baseline by a factor of four, in terms of the proportion of correct matches, in most scenarios. We also evaluate the adversary’s ability to predict individuals’ predisposition to Alzheimer’s disease, and we observe that the inference error can be halved compared to the baseline. We also analyze the effect of the number of known phenotypic traits on the success rate of the attack. As progress is made in genomic research, especially for genotype-phenotype associations, the threat presented in this paper will become more serious.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] H. L. Allen K. Estrada G. Lettre S. I. Berndt M. N. Weedon F. Rivadeneira C. J. Willer A. U. Jackson S. Vedantam S. Raychaudhuri et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467(7317):832–838 2010.

  • [2] M. E. Andrés N. E. Bordenabe K. Chatzikokolakis and C. Palamidessi. Geo-indistinguishability: Differential privacy for location-based systems. In CCS’13: Proc. of the 2013 ACM Conf. on Computer and Communications Security pages 901–914 2013.

  • [3] E. Ayday E. De Cristofaro J. Hubaux and G. Tsudik. The chills and thrills of whole genome sequencing. IEEE Computer Magazine 2015.

  • [4] E. Ayday J. L. Raisaro U. Hengartner A. Molyneaux and J.-P. Hubaux. Privacy-preserving processing of raw genomic data. In DPM’13: Proc. of the 8th Int’l Workshop on Data Privacy Management pages 133–147 2013.

  • [5] E. Ayday J. L. Raisaro J.-P. Hubaux and J. Rougemont. Protecting and evaluating genomic privacy in medical tests and personalized medicine. In WPES’13: Proc. of the 12th ACM Workshop on Privacy in the Electronic Society pages 95–106 2013.

  • [6] P. Baldi R. Baronio E. De Cristofaro P. Gasti and G. Tsudik. Countering GATTACA: Efficient and secure testing of fully-sequenced human genomes. In CCS’11: Proc. of the 18th ACM Conf. on Computer and Communications Security pages 691–702 2011.

  • [7] P. Claes H. Hill and M. D. Shriver. Toward DNA-based facial composites: Preliminary results and validation. Forensic Science International: Genetics 13:208–216 2014.

  • [8] P. Claes D. K. Liberton K. Daniels K. M. Rosana E. E. Quillen L. N. Pearson B. McEvoy M. Bauchet A. A. Zaidi W. Yao et al. Modeling 3D facial shape from DNA. PLoS Genetics 10(3):e1004224 2014.

  • [9] D. Clayton. On inferring presence of an individual in a mixture: a bayesian approach. Biostatistics 11(4):661–673 2010.

  • [10] Y. Erlich and A. Narayanan. Routes for breaching and protecting genetic privacy. Nature Reviews Genetics 15(6):409–421 2014.

  • [11] Z. Galil. Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR) 18(1):23–38 1986.

  • [12] J. Gitschier. Inferential genotyping of Y chromosomes in Latter-Day Saints founders and comparison to Utah samples in the HapMap project. American Journal of Human Genetics 84(2):251–258 2009.

  • [13] B. Greshake P. E. Bayer H. Rausch and J. Reda. open- SNP–A Crowdsourced Web Resource for Personal Genomics. PLoS ONE 9(3):e89204 Mar. 2014.

  • [14] M. Gymrek A. L. McGuire D. Golan E. Halperin and Y. Erlich. Identifying personal genomes by surname inference. Science 339(6117):321–324 2013.

  • [15] M. Gymrek A. L. McGuire D. Golan E. Halperin and Y. Erlich. Identifying personal genomes by surname inference. Science 339(6117):321–324 2013.

  • [16] E. C. Hayden. Privacy protections: The genome hacker. Nature 497:172–174 05 2013.

  • [17] N. Homer S. Szelinger M. Redman D. Duggan and W. Tembe. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 4 Aug. 2008.

  • [18] M. Humbert E. Ayday J.-P. Hubaux and A. Telenti. Addressing the concerns of the Lacks family: Quantification of kin genomic privacy. In CCS’13: Proc. of the 20th ACM Conf. on Computer and Communications Security pages 1141–1152 2013.

  • [19] H. K. Im E. R. Gamazon D. L. Nicolae and N. J. Cox. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. American Journal of Human Genetics 90(4):591–598 2012.

  • [20] A. Johnson and V. Shmatikov. Privacy-preserving data exploration in genome-wide association studies. In KDD’13: Proc. of the 19th ACM Int’l Conf. on Knowledge Discovery and Data mining pages 1079–1087 2013.

  • [21] M. Kantarcioglu W. Jiang Y. Liu and B. Malin. A cryptographic approach to securely share and query genomic sequences. IEEE Trans. on Information Technology in Biomedicine 12(5):606–617 2008.

  • [22] M. Kayser and P. de Knijff. Improving human forensics through advances in genetics genomics and molecular biology. Nature Reviews Genetics 12(3):179–192 2011.

  • [23] Z. Lin A. B. Owen and R. B. Altman. Genomic research and human subject privacy. Science 305(5681):183 Jul 2004.

  • [24] F. Liu F. van der Lijn C. Schurmann G. Zhu M. M. Chakravarty P. G. Hysi A. Wollstein O. Lao M. de Bruijne M. A. Ikram et al. A genome-wide association study identifies five loci influencing facial morphology in europeans. PLoS Genetics 8(9):e1002932 2012.

  • [25] B. A. Malin and L. Sweeney. How (not) to protect genomic data privacy in a distributed network: using trail reidentification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics 37(3):179–192 2004.

  • [26] A. K. Manning M.-F. Hivert R. A. Scott J. L. Grimsby N. Bouatia-Naji H. Chen D. Rybin C.-T. Liu L. F. Bielak I. Prokopenko et al. A genome-wide approach accounting for body mass index identifies genetic variants influencing fasting glycemic traits and insulin resistance. Nature Genetics 44(6):659–669 2012.

  • [27] A. Narayanan and V. Shmatikov. Robust de-anonymization of large sparse datasets. In SP’08: Proc. of the 29th IEEE Symp. on Security and Privacy pages 111–125 2008.

  • [28] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In SP’09: Proc. of the 30th IEEE Symp. on Security and Privacy pages 173–187 2009.

  • [29]

  • [30] X.-l. Ou J. Gao H. Wang H.-s. Wang H.-l. Lu and H.-y. Sun. Predicting human age with bloodstains by sjTREC quantification. PloS ONE 7(8):e42412 2012.

  • [31] A. Pollack. Building a face and a case on DNA. Feb. 2015.

  • [32] R. Shokri G. Theodorakopoulos J.-Y. Le Boudec and J.-P. Hubaux. Quantifying location privacy. In SP’11: Proc. of the 32nd IEEE Symp. on Security and Privacy pages 247–262 2011.

  • [33] M. Srivatsa and M. Hicks. Deanonymizing mobility traces: Using social network as a side-channel. In CCS’12: Proc. of the 19th ACM Conf. on Computer and Communications Security pages 628–637 2012.

  • [34] L. Sweeney A. Abu and J. Winn. Identifying participants in the personal genome project by name. 04/24/2013 2013.

  • [35] C. Troncoso B. Gierlichs B. Preneel and I. Verbauwhede. Perfect matching disclosure attacks. In PETS’08: Proc. of the 8th Privacy Enhancing Technologies Symp. pages 2–23 2008.

  • [36] C. Uhler A. Slavkovic and S. E. Fienberg. Privacy-preserving data sharing for genome-wide association studies. Journal of Privacy and Confidentiality 5(1) 2013.

  • [37] Last visited: Feb. 2015.

  • [38] S. Walsh F. Liu K. N. Ballantyne M. van Oven O. Lao and M. Kayser. IrisPlex: a sensitive DNA tool for accurate prediction of blue and brown eye colour in the absence of ancestry information. Forensic Science International: Genetics 5(3):170–180 2011.

  • [39] R. Wang Y. F. Li X. Wang H. Tang and X. Zhou. Learning your identity and disease from research papers: information leaks in genome wide association study. CCS’09: Proc. of the 16th ACM Conf. on Computer and Communications Security pages 534–544 2009.

  • [40] X. Zhou B. Peng Y. F. Li Y. Chen H. Tang and X. Wang. To release or not to release: Evaluating information leaks in aggregate human-genome data. ESORICS’11: Proc. of the 16th European Conf. on Research in Computer Security pages 607–627 2011.

  • [41] D. Zubakov F. Liu M. Van Zelm J. Vermeulen B. Oostra C. Van Duijn G. Driessen J. Van Dongen M. Kayser and A. Langerak. Estimating human age from T-cell DNA rearrangements. Current Biology 20(22):R970–R971 2010.

Journal information
Cited By
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 441 267 5
PDF Downloads 194 122 0