Cardinality Estimators do not Preserve Privacy

Open access

Abstract

Cardinality estimators like HyperLogLog are sketching algorithms that estimate the number of distinct elements in a large multiset. Their use in privacy-sensitive contexts raises the question of whether they leak private information. In particular, can they provide any privacy guarantees while preserving their strong aggregation properties?

We formulate an abstract notion of cardinality estimators, that captures this aggregation requirement: one can merge sketches without losing precision. We propose an attacker model and a corresponding privacy definition, strictly weaker than differential privacy: we assume that the attacker has no prior knowledge of the data. We then show that if a cardinality estimator satisfies this definition, then it cannot have a reasonable level of accuracy. We prove similar results for weaker versions of our definition, a nd a nalyze t he p rivacy o f existing algorithms, showing that their average privacy loss is significant, e ven f or m ultisets w ith l arge cardinalities. We conclude that strong aggregation requirements are incompatible with any reasonable definition o f privacy, and that cardinality estimators should be considered as sensitive as raw data. We also propose risk mitigation strategies for their real-world applications.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] [n. d.]. Approximate algorithms in Apache Spark: HyperLogLog and Quantiles. https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html. Accessed: 2018-05-03.

  • [2] [n. d.]. BigQuery Technical Documentation on Functions & Operators. https://cloud.google.com/bigquery/docs/reference/standardsql/functions-and-operators. Accessed: 2017-10-24.

  • [3] [n. d.]. HyperLogLog implementation for mssql. https://github.com/shuvava/mssql-hll. Accessed: 2018-05-03.

  • [4] [n. d.]. PostgreSQL extension extension adding HyperLogLog data structures as a native data type. https://github.com/citusdata/postgresql-hll. Accessed: 2018-05-03.

  • [5] Vikas G Ashok and Ravi Mukkamala. 2014. A scalable and efficient privacy preserving global itemset support approximation using Bloom filters. In IFIP Annual Conference on Data and Applications Security and Privacy. Springer 382–389.

  • [6] Ziv Bar-Yossef TS Jayram Ravi Kumar D Sivakumar and Luca Trevisan. 2002. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer 1–10.

  • [7] Raef Bassily Adam Groce Jonathan Katz and Adam Smith. 2013. Coupled-worlds privacy: Exploiting adversarial uncertainty in statistical data privacy. In Foundations of Computer Science (FOCS) 2013 IEEE 54th Annual Symposium on. IEEE 439–448.

  • [8] Raef Bassily and Adam Smith. 2015. Local private efficient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM 127–135.

  • [9] Raef Bassily Uri Stemmer Abhradeep Guha Thakurta et al. 2017. Practical locally private heavy hitters. In Advances in Neural Information Processing Systems. 2288–2296.

  • [10] Kevin Beyer Peter J Haas Berthold Reinwald Yannis Sismanis and Rainer Gemulla. 2007. On synopses for distinct-value estimation under multiset operations. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM 199–210.

  • [11] Raghav Bhaskar Abhishek Bhowmick Vipul Goyal Srivatsan Laxman and Abhradeep Thakurta. 2011. Noiseless database privacy. In International Conference on the Theory and Application of Cryptology and Information Security. Springer 215–232.

  • [12] George Casella and Roger L Berger. 2002. Statistical inference. Vol. 2. Duxbury Pacific Grove CA.

  • [13] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51 1 (2008) 107–113.

  • [14] Claudia Diaz Stefaan Seys Joris Claessens and Bart Preneel. 2002. Towards measuring anonymity. In International Workshop on Privacy Enhancing Technologies. Springer 54–68.

  • [15] Yitao Duan. 2009. Privacy without noise. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM 1517–1520.

  • [16] Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer 605–617.

  • [17] Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer 1–19.

  • [18] Cynthia Dwork Frank McSherry Kobbi Nissim and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference. Springer 265–284.

  • [19] Cynthia Dwork Guy N Rothblum and Salil Vadhan. 2010. Boosting and differential privacy. In Foundations of Computer Science (FOCS) 2010 51st Annual IEEE Symposium on. IEEE 51–60.

  • [20] Rolf Egert Marc Fischlin David Gens Sven Jacob Matthias Senker and Jörn Tillmanns. 2015. Privately computing set-union and set-intersection cardinality via Bloom filters. In Australasian Conference on Information Security and Privacy. Springer 413–430.

  • [21] Úlfar Erlingsson Vasyl Pihur and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. ACM 1054–1067.

  • [22] Cristian Estan George Varghese and Mike Fisk. 2003. Bitmap algorithms for counting active flows on high speed links. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement. ACM 153–166.

  • [23] Giulia Fanti Vasyl Pihur and Úlfar Erlingsson. 2016. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies 2016 3 (2016) 41–61.

  • [24] Philippe Flajolet Éric Fusy Olivier Gandouet and Frédéric Meunier. 2008. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. DMTCS Proceedings 1 (2008).

  • [25] Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31 2 (1985) 182–209.

  • [26] Philippe Golle and Kurt Partridge. 2009. On the anonymity of home/work location pairs. Pervasive computing (2009) 390–397.

  • [27] Michaela Gotz Ashwin Machanavajjhala Guozhang Wang Xiaokui Xiao and Johannes Gehrke. 2012. Publishing search logs—a comparative study of privacy guarantees. IEEE Transactions on Knowledge and Data Engineering 24 3 (2012) 520–532.

  • [28] Krzysztof Grining and Marek Klonowski. 2017. Towards Extending Noiseless Privacy: Dependent Data and More Practical Approach. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM 546–560.

  • [29] Stefan Heule Marc Nunkesser and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology. ACM 683–692.

  • [30] Daniel Kifer and Ashwin Machanavajjhala. 2012. A rigorous and customizable framework for privacy. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. ACM 77–88.

  • [31] Ashwin Machanavajjhala Daniel Kifer John Abowd Johannes Gehrke and Lars Vilhuber. 2008. Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE Computer Society 277–286.

  • [32] Luca Melis George Danezis and Emiliano De Cristofaro. 2015. Efficient private statistics with succinct sketches. arXiv preprint arXiv:1508.06110 (2015).

  • [33] Ilya Mironov. 2017. Renyi differential privacy. In Computer Security Foundations Symposium (CSF) 2017 IEEE 30th. IEEE 263–275.

  • [34] Anna Monreale Wendy Hui Wang Francesca Pratesi Salvatore Rinzivillo Dino Pedreschi Gennady Andrienko and Natalia Andrienko. 2013. Privacy-preserving distributed movement data aggregation. In Geographic Information Science at the Heart of Europe. Springer 225–245.

  • [35] Sriram Padmanabhan Bishwaranjan Bhattacharjee Tim Malkemus Leslie Cranston and Matthew Huras. 2003. Multidimensional clustering: A new data layout scheme in db2. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM 637–641.

  • [36] Odysseas Papapetrou Wolf Siberski and Wolfgang Nejdl. 2010. Cardinality estimation and dynamic length adaptation for Bloom filters. Distributed and Parallel Databases 28 2 (2010) 119–156.

  • [37] Vibhor Rastogi Michael Hay Gerome Miklau and Dan Suciu. 2009. Relationship privacy: output perturbation for queries with joins. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM 107–116.

  • [38] David Rebollo-Monedero and Jordi Forné. 2010. Optimized query forgery for private information retrieval. IEEE Transactions on Information Theory 56 9 (2010) 4631–4642.

  • [39] David Rebollo-Monedero Jordi Forne and Josep Domingo-Ferrer. 2010. From t-closeness-like privacy to postrandomization via information theory. IEEE Transactions on Knowledge and Data Engineering 22 11 (2010) 1623–1636.

  • [40] Amit Shukla Prasad Deshpande Jeffrey F Naughton and Karthikeyan Ramasamy. 1996. Storage estimation for multidimensional aggregates in the presence of hierarchies. In VLDB Vol. 96. 522–531.

  • [41] Florian Tschorsch and Björn Scheuermann. 2013. An algorithm for privacy-preserving distributed user statistics. Computer Networks 57 14 (2013) 2775–2787.

  • [42] Kyu-Young Whang Brad T Vander-Zanden and Howard M Taylor. 1990. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS) 15 2 (1990) 208–229.

  • [43] Shuheng Zhou Katrina Ligett and Larry Wasserman. 2009. Differential privacy with compression. In Information Theory 2009. ISIT 2009. IEEE International Symposium on. IEEE 2718–2722.

Search
Journal information
Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 232 232 45
PDF Downloads 148 148 15