SafePub: A Truthful Data Anonymization Algorithm With Strong Privacy Guarantees

Open access

Abstract

Methods for privacy-preserving data publishing and analysis trade off privacy risks for individuals against the quality of output data. In this article, we present a data publishing algorithm that satisfies the differential privacy model. The transformations performed are truthful, which means that the algorithm does not perturb input data or generate synthetic output data. Instead, records are randomly drawn from the input dataset and the uniqueness of their features is reduced. This also offers an intuitive notion of privacy protection. Moreover, the approach is generic, as it can be parameterized with different objective functions to optimize its output towards different applications. We show this by integrating six well-known data quality models. We present an extensive analytical and experimental evaluation and a comparison with prior work. The results show that our algorithm is the first practical implementation of the described approach and that it can be used with reasonable privacy parameters resulting in high degrees of protection. Moreover, when parameterizing the generic method with an objective function quantifying the suitability of data for building statistical classifiers, we measured prediction accuracies that compare very well with results obtained using state-of-the-art differentially private classification algorithms.

[1] A. Machanavajjhala et al. l-diversity: Privacy beyond kanonymity. Transactions on Knowledge Discovery from Data, 1(1):3, 2007.

[2] B. C. M. Fung et al. Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques. CRC Press, 2010.

[3] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In International Conference on Data Engineering, pages 217–228, 2005.

[4] J. Brickell and V. Shmatikov. The cost of privacy: Destruction of data-mining utility in anonymized data publishing. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 70–78, 2008.

[5] C. Clifton and T. Tassa. On syntactic anonymity and differential privacy. In International Conference on Data Engineering Workshops, pages 88–93, 2013.

[6] F. K. Dankar and K. El Emam. Practicing differential privacy in health care: A review. Transactions on Data Privacy, 6(1):35–67, 2013.

[7] T. de Waal and L. Willenborg. Information loss through global recoding and local suppression. Netherlands Official Statistics, 14:17–20, 1999.

[8] J. Domingo-Ferrer and J. Soria-Comas. From t-closeness to differential privacy and vice versa in data anonymization. Knowledge-Based Systems, 74:151–158, 2015.

[9] C. Dwork. An ad omnia approach to defining and achieving private data analysis. In International Conference on Privacy, Security, and Trust in KDD, pages 1–13, 2008.

[10] C. Dwork. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pages 1–19, 2008.

[11] K. El Emam and L. Arbuckle. Anonymizing Health Data. O’Reilly Media, 2013.

[12] K. El Emam and F. K. Dankar. Protecting privacy using k-anonymity. Jama-J Am. Med. Assoc., 15(5):627–637, 2008.

[13] K. El Emam and B. Malin. Appendix b: Concepts and methods for de-identifying clinical trial data. In Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk, pages 1–290. National Academies Press (US), 2015.

[14] European Medicines Agency. External guidance on the implementation of the european medicines agency policy on the publication of clinical data for medicinal products for human use. EMA/90915/2016, 2016.

[15] F. Prasser et al. Lightning: Utility-driven anonymization of high-dimensional data. Transactions on Data Privacy, 9(2):161–185, 2016.

[16] F. Prasser et al. A tool for optimizing de-identified health data for use in statistical classification. In IEEE International Symposium on Computer-Based Medical Systems, 2017.

[17] L. Fan and H. Jin. A practical framework for privacy-preserving data analytics. In International Conference on World Wide Web, pages 311–321, 2015.

[18] M. R. Fouad, K. Elbassioni, and E. Bertino. A supermodularity-based differential privacy preserving algorithm for data anonymization. IEEE Transactions on Knowledge and Data Engineering, 26(7):1591–1601, 2014.

[19] A. Friedman and A. Schuster. Data mining with differential privacy. In International Conference on Knowledge Discovery and Data Mining, pages 493–502, 2010.

[20] G. Cormode et al. Empirical privacy and empirical utility of anonymized data. In IEEE International Conference on Data Engineering Workshops, pages 77–82, 2013.

[21] G. Poulis et al. Secreta: a system for evaluating and comparing relational and transaction anonymization algorithms. In International Conference on Extending Database Technology, pages 620–623, 2014.

[22] J. Gehrke, E. Lui, and R. Pass. Towards privacy for social networks: A zero-knowledge based definition of privacy. In Theory of Cryptography Conference, pages 432–449, 2011.

[23] R. L. Graham, D. E. Knuth, and O. Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley publishing company, 2nd edition, 1994.

[24] Y. Hong, J. Vaidya, H. Lu, and M. Wu. Differentially private search log sanitization with optimal output utility. In International Conference on Extending Database Technology, pages 50–61, 2012.

[25] V. S. Iyengar. Transforming data to satisfy privacy constraints. In International Conference on Knowledge Discovery and Data Mining, pages 279–288, 2002.

[26] J. Gehrke et al. Crowd-blending privacy. In Advances in Cryptology, pages 479–496. Springer, 2012.

[27] J. Soria-Comas et al. Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB J., 23(5):771–794, 2014.

[28] J. Soria-Comas et al. t-closeness through microaggregation: Strict privacy with enhanced utility preservation. IEEE Transactions on Knowledge and Data Engineering, 27(11):3098–3110, 2015.

[29] J. Vaidya et al. Differentially private naive bayes classification. In IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, pages 571–576, 2013.

[30] Y. Jafer, S. Matwin, and M. Sokolova. Using feature selection to improve the utility of differentially private data publishing. Procedia Computer Science, 37:511–516, 2014.

[31] Z. Ji, Z. C. Lipton, and C. Elkan. Differential privacy and machine learning: a survey and review. CoRR, abs/1412.7584, 2014.

[32] Z. Jorgensen, T. Yu, and G. Cormode. Conservative or liberal? personalized differential privacy. In IEEE International Conference on Data Engineering, pages 1023–1034, April 2015.

[33] K. El Emam et al. A globally optimal k-anonymity method for the de-identification of health data. J. Am. Med. Inform. Assn., 16(5):670–682, 2009.

[34] F. Kohlmayer, F. Prasser, C. Eckert, A. Kemper, and K. A. Kuhn. Flash: efficient, stable and optimal k-anonymity. In 2012 International Conference on Privacy, Security, Risk and Trust (PASSAT) and 2012 International Conference on Social Computing (SocialCom), pages 708–717, 2012.

[35] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In International Conference on Management of Data, pages 49–60, 2005.

[36] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In International Conference on Data Engineering, pages 25–25, 2006.

[37] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Workloadaware anonymization techniques for large-scale datasets. ACM Transactions on Database Systems, 33(3):1–47, 2008.

[38] D. Leoni. Non-interactive differential privacy: A survey. In International Workshop on Open Data, pages 40–52, 2012.

[39] N. Li, W. Qardaji, and D. Su. On sampling, anonymization, and differential privacy: Or, k-anonymization meets differential privacy. In ACM Symposium on Information, Computer and Communications Security, pages 32–33, 2012.

[40] N. Li, W. H. Qardaji, and D. Su. Provably private data anonymization: Or, k-anonymity meets differential privacy. CoRR, abs/1101.2604, 2011.

[41] T. Li and N. Li. On the tradeoff between privacy and utility in data publishing. In International Conference on Knowledge Discovery and Data Mining, pages 517–526, 2009.

[42] M. Lichman. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2013.

[43] M. R. Fouad, K. Elbassioni, and E. Bertino. Towards a differentially private data anonymization. CERIAS Tech Report 2012-1, Purdue Univ., 2012.

[44] F. McSherry and K. Talwar. Mechanism design via differential privacy. In IEEE Symposium on Foundations of Computer Science, pages 94–103, 2007.

[45] F. D. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In International Conference on Management of Data, pages 19–30, 2009.

[46] N. Mohammed et al. Differentially private data release for data mining. In International Conference on Knowledge Discovery and Data Mining, pages 493–501, 2011.

[47] M. E. Nergiz, M. Atzori, and C. Clifton. Hiding the presence of individuals from shared databases. In International Conference on Management of Data, pages 665–676, 2007.

[48] F. Prasser, F. Kohlmayer, and K. A. Kuhn. The importance of context: Risk-based de-identification of biomedical data. Methods of information in medicine, 55(4):347–355, 2016.

[49] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.

[50] F. Ritchie and M. Elliott. Principles-versus rules-based output statistical disclosure control in remote access environments. IASSIST Quarterly, 39(2):5–13, 2015.

[51] A. D. Sarwate and K. Chaudhuri. Signal processing and machine learning with differential privacy: Algorithms and challenges for continuous data. IEEE Signal Processing Magazine, 30(5):86–94, 2013.

[52] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5):571–588, Oct. 2002.

[53] L. Willenborg and T. De Waal. Statistical disclosure control in practice. Springer Science & Business Media, 1996.

[54] I. H. Witten and F. Eibe. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.

[55] X. Jiang et al. Differential-private data publishing through component analysis. Transactions on Data Privacy, 6(1):19–34, Apr. 2013.

[56] Z. Wan et al. A game theoretic framework for analyzing re-identification risk. PloS one, 10(3):e0120592, 2015.

[57] N. Zhang, M. Li, and W. Lou. Distributed data mining with differential privacy. In IEEE International Conference on Communications, pages 1–5, 2011.

Journal Information

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 611 611 83
PDF Downloads 389 389 95