Differentially Private SQL with Bounded User Contribution

Royce J Wilson, Celia Yuxin Zhang, William Lam, Damien Desfontaines, Daniel Simmons-Marengo and Bryant Gipson

Abstract

Differential privacy (DP) provides formal guarantees that the output of a database query does not reveal too much information about any individual present in the database. While many differentially private algorithms have been proposed in the scientific literature, there are only a few end-to-end implementations of differentially private query engines. Crucially, existing systems assume that each individual is associated with at most one database record, which is unrealistic in practice. We propose a generic and scalable method to perform differentially private aggregations on databases, even when individuals can each be associated with arbitrarily many rows. We express this method as an operator in relational algebra, and implement it in an SQL engine. To validate this system, we test the utility of typical queries on industry benchmarks, and verify its correctness with a stochastic test framework we developed. We highlight the promises and pitfalls learned when deploying such a system in practice, and we publish its core components as open-source software.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Kareem Amin, Alex Kulesza, Andres Munoz, and Sergei Vassilvtiskii. Bounding user contributions: A bias-variance trade-off in differential privacy. In Proceedings of the 36th International Conference on Machine Learning, PMLR 97, pages 263–271, 2019.

  • [2] Johes Bater, Xi He, William Ehrich, Ashwin Machanavajjhala, and Jennie Rogers. Shrinkwrap: Differentially-private query processing in private data federations. arXiv preprint arXiv:1810.01816, 2018.

  • [3] Michael Ben-Or and Avinatan Hassidim. The Bayesian learner is optimal for noisy binary search (and pretty good for quantum as well). In 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pages 221–230. IEEE, 2008.

  • [4] Benjamin Bichsel, Timon Gehr, Dana Drachsler-Cohen, Petar Tsankov, and Martin Vechev. DP-finder: Finding differential privacy violations by sampling and optimization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 508–524. ACM, 2018.

  • [5] Vincent Bindschaedler, Reza Shokri, and Carl A Gunter. Plausible deniability for privacy-preserving data synthesis. Proceedings of the VLDB Endowment, 10(5):481–492, 2017.

  • [6] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.

  • [7] Transaction Processing Performance Council. TPC-H benchmark specification. http://www.tpc.org/tpch/, 2008.

  • [8] Damien Desfontaines and Balázs Pejó. Sok: Differential privacies. arXiv preprint arXiv:1906.01337, 2019.

  • [9] Zeyu Ding, Yuxin Wang, Guanhong Wang, Danfeng Zhang, and Daniel Kifer. Detecting violations of differential privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, pages 475–489, New York, NY, USA, 2018. ACM.

  • [10] Cynthia Dwork. An ad omnia approach to defining and achieving private data analysis. In International Workshop on Privacy, Security, and Trust in KDD, pages 1–13. Springer, 2007.

  • [11] Cynthia Dwork. The differential privacy frontier. In Theory of Cryptography Conference, pages 496–502. Springer, 2009.

  • [12] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer, 2006.

  • [13] Cynthia Dwork, Moni Naor, Toniann Pitassi, Guy N Roth-blum, and Sergey Yekhanin. Pan-private streaming algorithms. In ICS, pages 66–80, 2010.

  • [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.

  • [15] Paul Francis, Sebastian Probst Eide, and Reinhard Munz. Diffix: High-utility database anonymization. In Annual Privacy Forum, pages 141–158. Springer, 2017.

  • [16] Quan Geng and Pramod Viswanath. The optimal mechanism in differential privacy. arXiv preprint arXiv:1212.1186, 2012.

  • [17] Michaela Gotz, Ashwin Machanavajjhala, Guozhang Wang, Xiaokui Xiao, and Johannes Gehrke. Publishing search logs—a comparative study of privacy guarantees. IEEE Transactions on Knowledge and Data Engineering, 24(3):520–532, 2011.

  • [18] J. H. Halton. Algorithm 247: Radical-inverse quasi-random point sequence. Commun. ACM, 7(12):701–702, December 1964.

  • [19] Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, Arjun Narayan, Benjamin C Pierce, and Aaron Roth. Differential privacy: An economic method for choosing epsilon. In 2014 IEEE 27th Computer Security Foundations Symposium, pages 398–410. IEEE, 2014.

  • [20] Noah Johnson and Joseph P Near. Dataflow analysis & differential privacy for SQL queries. https://github.com/uber/sql-differential-privacy. Accessed: 2019-09-04.

  • [21] Noah Johnson, Joseph P Near, and Dawn Song. Towards practical differential privacy for SQL queries. Proceedings of the VLDB Endowment, 11(5):526–539, 2018.

  • [22] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. IEEE Transactions on Information Theory, 63(6):4037–4049, 2017.

  • [23] Richard M Karp and Robert Kleinberg. Noisy binary search and its applications. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 881–890. Society for Industrial and Applied Mathematics, 2007.

  • [24] Daniel Kifer and Ashwin Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 193–204. ACM, 2011.

  • [25] Aleksandra Korolova, Krishnaram Kenthapadi, Nina Mishra, and Alexandros Ntoulas. Releasing search queries and clicks privately. In Proceedings of the 18th international conference on World wide web, pages 171–180. ACM, 2009.

  • [26] Ios Kotsogiannis, Yuchao Tao, Xi He, Maryam Fanaeepour, Ashwin Machanavajjhala, Michael Hay, and Gerome Miklau. Privatesql: a differentially private sql query engine. Proceedings of the VLDB Endowment, 12(11):1371–1384, 2019.

  • [27] Ios Kotsogiannis, Yuchao Tao, Ashwin Machanavajjhala, Gerome Miklau, and Michael Hay. Architecting a differentially private SQL engine. In Conference on Innovative Data Systems Research, 2019.

  • [28] Sara Krehbiel. Choosing epsilon for privacy as a service. Proceedings on Privacy Enhancing Technologies, 2019(1):192–205, 2019.

  • [29] Jaewoo Lee and Chris Clifton. How much is enough? choosing ɛ for differential privacy. In International Conference on Information Security, pages 325–340. Springer, 2011.

  • [30] Chao Li, Michael Hay, Gerome Miklau, and Yue Wang. A data-and workload-aware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment, 7(5):341–352, 2014.

  • [31] Ninghui Li, Min Lyu, Dong Su, and Weining Yang. Differential privacy: From theory to practice. Synthesis Lectures on Information Security, Privacy, & Trust, 8(4):1–138, 2016.

  • [32] Frank D McSherry. Synthethic data via differential privacy. https://github.com/frankmcsherry/blog/blob/master/assets/Synth-SIGMOD.pdf. Accessed: 2019-05-28.

  • [33] Frank D McSherry. Uber’s differential privacy.. probably isn’t. https://github.com/frankmcsherry/blog/blob/master/posts/2018-02-25.md. Accessed: 2019-03-22.

  • [34] Frank D McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19–30. ACM, 2009.

  • [35] Sebastian Meiser and Esfandiar Mohammadi. Tight on budget?: Tight bounds for r-fold approximate differential privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 247–264. ACM, 2018.

  • [36] Ilya Mironov. On significance of the least significant bits for differential privacy. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 650–661. ACM, 2012.

  • [37] Ilya Mironov. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pages 263–275. IEEE, 2017.

  • [38] Maurizio Naldi and Giuseppe D’Acquisto. Differential privacy: an estimation theory-based method for choosing epsilon. arXiv preprint arXiv:1510.00917, 2015.

  • [39] Arjun Narayan and Andreas Haeberlen. DJoin: differentially private join queries over distributed databases. In Presented as part of the 10th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 12), pages 149–162, 2012.

  • [40] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75–84. ACM, 2007.

  • [41] Kobbi Nissim, Thomas Steinke, Alexandra Wood, Micah Altman, Aaron Bembenek, Mark Bun, Marco Gaboardi, David R O’Brien, and Salil Vadhan. Differential privacy: A primer for a non-technical audience. In Privacy Law Scholars Conf, 2017.

  • [42] Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.

OPEN ACCESS

Journal + Issues

Search