SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees

Open access

Abstract

Methods for privacy-preserving data publishing and analysis trade off privacy risks for individuals against the quality of output data. In this article, we present a data publishing algorithm that satisfies the differential privacy model. The transformations performed are truthful, which means that the algorithm does not perturb input data or generate synthetic output data. Instead, records are randomly drawn from the input dataset and the uniqueness of their features is reduced. This also offers an intuitive notion of privacy protection. Moreover, the approach is generic, as it can be parameterized with different objective functions to optimize its output towards different applications. We show this by integrating six well-known data quality models. We present an extensive analytical and experimental evaluation and a comparison with prior work. The results show that our algorithm is the first practical implementation of the described approach and that it can be used with reasonable privacy parameters resulting in high degrees of protection. Moreover, when parameterizing the generic method with an objective function quantifying the suitability of data for building statistical classifiers, we measured prediction accuracies that compare very well with results obtained using state-of-the-art differentially private classification algorithms.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] A. Machanavajjhala et al. l-diversity: Privacy beyond kanonymity. Transactions on Knowledge Discovery from Data 1(1):3 2007.

  • [2] B. C. M. Fung et al. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press 2010.

  • [3] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In International Conference on Data Engineering pages 217–228 2005.

  • [4] J. Brickell and V. Shmatikov. The cost of privacy: Destruction of data-mining utility in anonymized data publishing. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pages 70–78 2008.

  • [5] C. Clifton and T. Tassa. On syntactic anonymity and differential privacy. In International Conference on Data Engineering Workshops pages 88–93 2013.

  • [6] F. K. Dankar and K. El Emam. Practicing differential privacy in health care: A review. Transactions on Data Privacy 6(1):35–67 2013.

  • [7] T. de Waal and L. Willenborg. Information loss through global recoding and local suppression. Netherlands Official Statistics 14:17–20 1999.

  • [8] J. Domingo-Ferrer and J. Soria-Comas. From t-closeness to differential privacy and vice versa in data anonymization. Knowledge-Based Systems 74:151–158 2015.

  • [9] C. Dwork. An ad omnia approach to defining and achieving private data analysis. In International Conference on Privacy Security and Trust in KDD pages 1–13 2008.

  • [10] C. Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation pages 1–19 2008.

  • [11] K. El Emam and L. Arbuckle. Anonymizing Health Data. O’Reilly Media 2013.

  • [12] K. El Emam and F. K. Dankar. Protecting privacy using k-anonymity. Jama-J Am. Med. Assoc. 15(5):627–637 2008.

  • [13] K. El Emam and B. Malin. Appendix b: Concepts and methods for de-identifying clinical trial data. In Sharing Clinical Trial Data: Maximizing Benefits Minimizing Risk pages 1–290. National Academies Press (US) 2015.

  • [14] European Medicines Agency. External guidance on the implementation of the european medicines agency policy on the publication of clinical data for medicinal products for human use. EMA/90915/2016 2016.

  • [15] F. Prasser et al. Lightning: Utility-driven anonymization of high-dimensional data. Transactions on Data Privacy 9(2):161–185 2016.

  • [16] F. Prasser et al. A tool for optimizing de-identified health data for use in statistical classification. In IEEE International Symposium on Computer-Based Medical Systems 2017.

  • [17] L. Fan and H. Jin. A practical framework for privacy-preserving data analytics. In International Conference on World Wide Web pages 311–321 2015.

  • [18] M. R. Fouad K. Elbassioni and E. Bertino. A supermodularity-based differential privacy preserving algorithm for data anonymization. IEEE Transactions on Knowledge and Data Engineering 26(7):1591–1601 2014.

  • [19] A. Friedman and A. Schuster. Data mining with differential privacy. In International Conference on Knowledge Discovery and Data Mining pages 493–502 2010.

  • [20] G. Cormode et al. Empirical privacy and empirical utility of anonymized data. In IEEE International Conference on Data Engineering Workshops pages 77–82 2013.

  • [21] G. Poulis et al. Secreta: a system for evaluating and comparing relational and transaction anonymization algorithms. In International Conference on Extending Database Technology pages 620–623 2014.

  • [22] J. Gehrke E. Lui and R. Pass. Towards privacy for social networks: A zero-knowledge based definition of privacy. In Theory of Cryptography Conference pages 432–449 2011.

  • [23] R. L. Graham D. E. Knuth and O. Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley publishing company 2nd edition 1994.

  • [24] Y. Hong J. Vaidya H. Lu and M. Wu. Differentially private search log sanitization with optimal output utility. In International Conference on Extending Database Technology pages 50–61 2012.

  • [25] V. S. Iyengar. Transforming data to satisfy privacy constraints. In International Conference on Knowledge Discovery and Data Mining pages 279–288 2002.

  • [26] J. Gehrke et al. Crowd-blending privacy. In Advances in Cryptology pages 479–496. Springer 2012.

  • [27] J. Soria-Comas et al. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB J. 23(5):771–794 2014.

  • [28] J. Soria-Comas et al. t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Transactions on Knowledge and Data Engineering 27(11):3098–3110 2015.

  • [29] J. Vaidya et al. Differentially private naive bayes classification. In IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies pages 571–576 2013.

  • [30] Y. Jafer S. Matwin and M. Sokolova. Using feature selection to improve the utility of differentially private data publishing. Procedia Computer Science 37:511–516 2014.

  • [31] Z. Ji Z. C. Lipton and C. Elkan. Differential privacy and machine learning: a survey and review. CoRR abs/1412.7584 2014.

  • [32] Z. Jorgensen T. Yu and G. Cormode. Conservative or liberal? personalized differential privacy. In IEEE International Conference on Data Engineering pages 1023–1034 April 2015.

  • [33] K. El Emam et al. A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assn. 16(5):670–682 2009.

  • [34] F. Kohlmayer F. Prasser C. Eckert A. Kemper and K. A. Kuhn. Flash: efficient stable and optimal k-anonymity. In 2012 International Conference on Privacy Security Risk and Trust (PASSAT) and 2012 International Conference on Social Computing (SocialCom) pages 708–717 2012.

  • [35] K. LeFevre D. J. DeWitt and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In International Conference on Management of Data pages 49–60 2005.

  • [36] K. LeFevre D. J. DeWitt and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In International Conference on Data Engineering pages 25–25 2006.

  • [37] K. LeFevre D. J. DeWitt and R. Ramakrishnan. Workloadaware anonymization techniques for large-scale datasets. ACM Transactions on Database Systems 33(3):1–47 2008.

  • [38] D. Leoni. Non-interactive differential privacy: A survey. In International Workshop on Open Data pages 40–52 2012.

  • [39] N. Li W. Qardaji and D. Su. On sampling anonymization and differential privacy: Or k-anonymization meets differential privacy. In ACM Symposium on Information Computer and Communications Security pages 32–33 2012.

  • [40] N. Li W. H. Qardaji and D. Su. Provably private data anonymization: Or k-anonymity meets differential privacy. CoRR abs/1101.2604 2011.

  • [41] T. Li and N. Li. On the tradeoff between privacy and utility in data publishing. In International Conference on Knowledge Discovery and Data Mining pages 517–526 2009.

  • [42] M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml 2013.

  • [43] M. R. Fouad K. Elbassioni and E. Bertino. Towards a differentially private data anonymization. CERIAS Tech Report 2012-1 Purdue Univ. 2012.

  • [44] F. McSherry and K. Talwar. Mechanism design via differential privacy. In IEEE Symposium on Foundations of Computer Science pages 94–103 2007.

  • [45] F. D. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In International Conference on Management of Data pages 19–30 2009.

  • [46] N. Mohammed et al. Differentially private data release for data mining. In International Conference on Knowledge Discovery and Data Mining pages 493–501 2011.

  • [47] M. E. Nergiz M. Atzori and C. Clifton. Hiding the presence of individuals from shared databases. In International Conference on Management of Data pages 665–676 2007.

  • [48] F. Prasser F. Kohlmayer and K. A. Kuhn. The importance of context: Risk-based de-identification of biomedical data. Methods of information in medicine 55(4):347–355 2016.

  • [49] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc. 1993.

  • [50] F. Ritchie and M. Elliott. Principles-versus rules-based output statistical disclosure control in remote access environments. IASSIST Quarterly 39(2):5–13 2015.

  • [51] A. D. Sarwate and K. Chaudhuri. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE Signal Processing Magazine 30(5):86–94 2013.

  • [52] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 10(5):571–588 Oct. 2002.

  • [53] L. Willenborg and T. De Waal. Statistical disclosure control in practice. Springer Science & Business Media 1996.

  • [54] I. H. Witten and F. Eibe. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann 2005.

  • [55] X. Jiang et al. Differential-private data publishing through component analysis. Transactions on Data Privacy 6(1):19–34 Apr. 2013.

  • [56] Z. Wan et al. A game theoretic framework for analyzing re-identification risk. PloS one 10(3):e0120592 2015.

  • [57] N. Zhang M. Li and W. Lou. Distributed data mining with differential privacy. In IEEE International Conference on Communications pages 1–5 2011.

Search
Journal information
Cited By
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 2002 1305 66
PDF Downloads 1568 1088 23