Cardinality Estimators do not Preserve Privacy

Damien Desfontaines 1 , Andreas Lochbihler 2 ,  and David Basin 3
  • 1 ETH Zurich,
  • 2
  • 3 ETH Zurich,


Cardinality estimators like HyperLogLog are sketching algorithms that estimate the number of distinct elements in a large multiset. Their use in privacy-sensitive contexts raises the question of whether they leak private information. In particular, can they provide any privacy guarantees while preserving their strong aggregation properties?

We formulate an abstract notion of cardinality estimators, that captures this aggregation requirement: one can merge sketches without losing precision. We propose an attacker model and a corresponding privacy definition, strictly weaker than differential privacy: we assume that the attacker has no prior knowledge of the data. We then show that if a cardinality estimator satisfies this definition, then it cannot have a reasonable level of accuracy. We prove similar results for weaker versions of our definition, a nd a nalyze t he p rivacy o f existing algorithms, showing that their average privacy loss is significant, e ven f or m ultisets w ith l arge cardinalities. We conclude that strong aggregation requirements are incompatible with any reasonable definition o f privacy, and that cardinality estimators should be considered as sensitive as raw data. We also propose risk mitigation strategies for their real-world applications.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] [n. d.]. Approximate algorithms in Apache Spark: HyperLogLog and Quantiles. Accessed: 2018-05-03.

  • [2] [n. d.]. BigQuery Technical Documentation on Functions & Operators. Accessed: 2017-10-24.

  • [3] [n. d.]. HyperLogLog implementation for mssql. Accessed: 2018-05-03.

  • [4] [n. d.]. PostgreSQL extension extension adding HyperLogLog data structures as a native data type. Accessed: 2018-05-03.

  • [5] Vikas G Ashok and Ravi Mukkamala. 2014. A scalable and efficient privacy preserving global itemset support approximation using Bloom filters. In IFIP Annual Conference on Data and Applications Security and Privacy. Springer, 382–389.

  • [6] Ziv Bar-Yossef, TS Jayram, Ravi Kumar, D Sivakumar, and Luca Trevisan. 2002. Counting distinct elements in a data stream. In International Workshop on Randomization and Approximation Techniques in Computer Science. Springer, 1–10.

  • [7] Raef Bassily, Adam Groce, Jonathan Katz, and Adam Smith. 2013. Coupled-worlds privacy: Exploiting adversarial uncertainty in statistical data privacy. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 439–448.

  • [8] Raef Bassily and Adam Smith. 2015. Local, private, efficient protocols for succinct histograms. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing. ACM, 127–135.

  • [9] Raef Bassily, Uri Stemmer, Abhradeep Guha Thakurta, et al. 2017. Practical locally private heavy hitters. In Advances in Neural Information Processing Systems. 2288–2296.

  • [10] Kevin Beyer, Peter J Haas, Berthold Reinwald, Yannis Sismanis, and Rainer Gemulla. 2007. On synopses for distinct-value estimation under multiset operations. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 199–210.

  • [11] Raghav Bhaskar, Abhishek Bhowmick, Vipul Goyal, Srivatsan Laxman, and Abhradeep Thakurta. 2011. Noiseless database privacy. In International Conference on the Theory and Application of Cryptology and Information Security. Springer, 215–232.

  • [12] George Casella and Roger L Berger. 2002. Statistical inference. Vol. 2. Duxbury Pacific Grove, CA.

  • [13] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.

  • [14] Claudia Diaz, Stefaan Seys, Joris Claessens, and Bart Preneel. 2002. Towards measuring anonymity. In International Workshop on Privacy Enhancing Technologies. Springer, 54–68.

  • [15] Yitao Duan. 2009. Privacy without noise. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 1517–1520.

  • [16] Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinalities. In European Symposium on Algorithms. Springer, 605–617.

  • [17] Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation. Springer, 1–19.

  • [18] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference. Springer, 265–284.

  • [19] Cynthia Dwork, Guy N Rothblum, and Salil Vadhan. 2010. Boosting and differential privacy. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on. IEEE, 51–60.

  • [20] Rolf Egert, Marc Fischlin, David Gens, Sven Jacob, Matthias Senker, and Jörn Tillmanns. 2015. Privately computing set-union and set-intersection cardinality via Bloom filters. In Australasian Conference on Information Security and Privacy. Springer, 413–430.

  • [21] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security. ACM, 1054–1067.

  • [22] Cristian Estan, George Varghese, and Mike Fisk. 2003. Bitmap algorithms for counting active flows on high speed links. In Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement. ACM, 153–166.

  • [23] Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. 2016. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies 2016, 3 (2016), 41–61.

  • [24] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2008. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. DMTCS Proceedings 1 (2008).

  • [25] Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182–209.

  • [26] Philippe Golle and Kurt Partridge. 2009. On the anonymity of home/work location pairs. Pervasive computing (2009), 390–397.

  • [27] Michaela Gotz, Ashwin Machanavajjhala, Guozhang Wang, Xiaokui Xiao, and Johannes Gehrke. 2012. Publishing search logs—a comparative study of privacy guarantees. IEEE Transactions on Knowledge and Data Engineering 24, 3 (2012), 520–532.

  • [28] Krzysztof Grining and Marek Klonowski. 2017. Towards Extending Noiseless Privacy: Dependent Data and More Practical Approach. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM, 546–560.

  • [29] Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 683–692.

  • [30] Daniel Kifer and Ashwin Machanavajjhala. 2012. A rigorous and customizable framework for privacy. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems. ACM, 77–88.

  • [31] Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering. IEEE Computer Society, 277–286.

  • [32] Luca Melis, George Danezis, and Emiliano De Cristofaro. 2015. Efficient private statistics with succinct sketches. arXiv preprint arXiv:1508.06110 (2015).

  • [33] Ilya Mironov. 2017. Renyi differential privacy. In Computer Security Foundations Symposium (CSF), 2017 IEEE 30th. IEEE, 263–275.

  • [34] Anna Monreale, Wendy Hui Wang, Francesca Pratesi, Salvatore Rinzivillo, Dino Pedreschi, Gennady Andrienko, and Natalia Andrienko. 2013. Privacy-preserving distributed movement data aggregation. In Geographic Information Science at the Heart of Europe. Springer, 225–245.

  • [35] Sriram Padmanabhan, Bishwaranjan Bhattacharjee, Tim Malkemus, Leslie Cranston, and Matthew Huras. 2003. Multidimensional clustering: A new data layout scheme in db2. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM, 637–641.

  • [36] Odysseas Papapetrou, Wolf Siberski, and Wolfgang Nejdl. 2010. Cardinality estimation and dynamic length adaptation for Bloom filters. Distributed and Parallel Databases 28, 2 (2010), 119–156.

  • [37] Vibhor Rastogi, Michael Hay, Gerome Miklau, and Dan Suciu. 2009. Relationship privacy: output perturbation for queries with joins. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 107–116.

  • [38] David Rebollo-Monedero and Jordi Forné. 2010. Optimized query forgery for private information retrieval. IEEE Transactions on Information Theory 56, 9 (2010), 4631–4642.

  • [39] David Rebollo-Monedero, Jordi Forne, and Josep Domingo-Ferrer. 2010. From t-closeness-like privacy to postrandomization via information theory. IEEE Transactions on Knowledge and Data Engineering 22, 11 (2010), 1623–1636.

  • [40] Amit Shukla, Prasad Deshpande, Jeffrey F Naughton, and Karthikeyan Ramasamy. 1996. Storage estimation for multidimensional aggregates in the presence of hierarchies. In VLDB, Vol. 96. 522–531.

  • [41] Florian Tschorsch and Björn Scheuermann. 2013. An algorithm for privacy-preserving distributed user statistics. Computer Networks 57, 14 (2013), 2775–2787.

  • [42] Kyu-Young Whang, Brad T Vander-Zanden, and Howard M Taylor. 1990. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS) 15, 2 (1990), 208–229.

  • [43] Shuheng Zhou, Katrina Ligett, and Larry Wasserman. 2009. Differential privacy with compression. In Information Theory, 2009. ISIT 2009. IEEE International Symposium on. IEEE, 2718–2722.


Journal + Issues