Building a RAPPOR with the Unknown: Privacy-Preserving Learning of Associations and Data Dictionaries

Open access

Abstract

Techniques based on randomized response enable the collection of potentially sensitive data from clients in a privacy-preserving manner with strong local differential privacy guarantees. A recent such technology, RAPPOR [12], enables estimation of the marginal frequencies of a set of strings via privacy-preserving crowdsourcing. However, this original estimation process relies on a known dictionary of possible strings; in practice, this dictionary can be extremely large and/or unknown. In this paper, we propose a novel decoding algorithm for the RAPPOR mechanism that enables the estimation of “unknown unknowns,” i.e., strings we do not know we should be estimating. To enable learning without explicit dictionary knowledge, we develop methodology for estimating the joint distribution of multiple variables collected with RAPPOR. Our contributions are not RAPPOR-specific, and can be generalized to other local differential privacy mechanisms for learning distributions of string-valued random variables.

[1] A. Agresti. Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley-Interscience, 2nd edition, 2002.

[2] Alexa. The top 500 sites on the web. http://www.alexa.com/topsites.

[3] Jane R Bambauer, Krish Muralidhar, and Rathindra Sarathy. Fool’s gold: an illustrated critique of differential privacy. 2013.

[4] Raef Bassily and Adam Smith. Local, private, efficient protocols for succinct histograms. In STOC. ACM, June 2015, to appear.

[5] T-H Hubert Chan, Mingfei Li, Elaine Shi, and Wenchang Xu. Differentially private continual monitoring of heavy hitters from distributed streams. In Privacy Enhancing Technologies, pages 140-159. Springer, 2012.

[6] Martin J Crowder. Maximum likelihood estimation for dependent observations. Journal of the Royal Statistical Society. Series B (Methodological), 1976.

[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977.

[8] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1-38, 1977.

[9] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In FOCS, pages 429-438. IEEE, 2013.

[10] Cynthia Dwork. Differential privacy. In Automata, languages and programming, pages 1-12. Springer, 2006.

[11] Robert F Engle et al. Wald, likelihood ratio, and lagrange multiplier tests in econometrics. Handbook of econometrics, 1984.

[12] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. ACM CCS, 2014.

[13] Stephen E Fienberg, Alessandro Rinaldo, and Xiaolin Yang. Differential privacy and the risk-utility tradeoff for multidimensional contingency tables. In Privacy in Statistical Databases, pages 187-199. Springer, 2011.

[14] Justin Hsu, Sanjeev Khanna, and Aaron Roth. Distributed private heavy hitters. In Automata, Languages, and Programming, pages 461-472. Springer, 2012.

[15] Google Inc. Google research blog: Learning Statistics with Privacy, aided by the Flip of a Coin. http://googleresearch.blogspot.com/2014/10/learning-statistics-with-privacyaided.html, .

[16] Google Inc. Unwanted Software Policy. http://www.google.com/about/company/unwanted-software-policy.html, .

[17] Prateek Jain, Pravesh Kothari, and Abhradeep Thakurta. Differentially private online learning. arXiv preprint arXiv:1109.0105, 2011.

[18] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential privacy. In NIPS, 2014.

[19] Shiva Prasad Kasiviswanathan and Adam Smith. A note on differential privacy: Defining resistance to arbitrary side information. CoRR abs/0803.3946, 2008.

[20] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? In FOCS, pages 531-540, Washington, DC, USA, 2008.

[21] Joe Kilian, André Madeira, Martin J Strauss, and Xuan Zheng. Fast private norm estimation and heavy hitters. In Theory of Cryptography, pages 176-193. Springer, 2008.

[22] Frank McSherry and Ilya Mironov. Differentially private recommender systems: building privacy into the net. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 627-636. ACM, 2009.

[23] M Mirghorbani and P Krokhmal. On finding k-cliques in k-partite graphs. Optimization Letters, 7(6):1155-1165, 2013.

[24] Vibhor Rastogi and Suman Nath. Differentially private aggregation of distributed time-series with transformation and encryption. In SIGMOD International Conference on Management of data, pages 735-746. ACM, 2010.

[25] Elaine Shi, T-H Hubert Chan, Eleanor G Rieffel, Richard Chow, and Dawn Song. Privacy-preserving aggregation of time-series data. In NDSS, volume 2, page 4, 2011.

[26] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

[27] Stanley Warner. Randomized response: A survey technique for eliminating evasive answer bias. JASA, 60(309):pp. 63-69, 1965.

[28] Oliver Williams and Frank McSherry. Probabilistic inference and differential privacy. In NIPS, pages 2451-2459, 2010.

Journal Information

Cited By

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 246 246 50
PDF Downloads 144 144 34