Toward Distribution Estimation under Local Differential Privacy with Small Samples

Open access

Abstract

A number of studies have recently been made on discrete distribution estimation in the local model, in which users obfuscate their personal data (e.g., location, response in a survey) by themselves and a data collector estimates a distribution of the original personal data from the obfuscated data. Unlike the centralized model, in which a trusted database administrator can access all users’ personal data, the local model does not suffer from the risk of data leakage. A representative privacy metric in this model is LDP (Local Differential Privacy), which controls the amount of information leakage by a parameter ∈ called privacy budget. When ∈ is small, a large amount of noise is added to the personal data, and therefore users’ privacy is strongly protected. However, when the number of users ℕ is small (e.g., a small-scale enterprise may not be able to collect large samples) or when most users adopt a small value of ∈, the estimation of the distribution becomes a very challenging task. The goal of this paper is to accurately estimate the distribution in the cases explained above. To achieve this goal, we focus on the EM (Expectation-Maximization) reconstruction method, which is a state-of-the-art statistical inference method, and propose a method to correct its estimation error (i.e., difference between the estimate and the true value) using the theory of Rilstone et al. We prove that the proposed method reduces the MSE (Mean Square Error) under some assumptions.We also evaluate the proposed method using three largescale datasets, two of which contain location data while the other contains census data. The results show that the proposed method significantly outperforms the EM reconstruction method in all of the datasets when ℕ or ∈ is small.

[1] Aggarwal CC, Yu PS (2008) Privacy-Preserving Data Mining. Springer

[2] Agrawal D, Aggarwal CC (2001) On the design and quantification of privacy preserving data mining algorithms. In: Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’01), pp 247-255

[3] Agrawal R, Srikant R, Thomas D (2005) Privacy preserving OLAP. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data (SIGMOD’05), pp 251-262

[4] Andrés ME, Bordenabe NE, Chatzikokolakis K, Palamidessi C (2013) Geo-indistinguishability: Differential privacy for location-based systems. In: Proceedings of the 20th ACM Conference on Computer and Communications Security (CCS’13), pp 901-914

[5] Bao Y, Ullah A (2007) The second-order bias and mean squared error of estimators in time-series models. Journal of Econometrics 140(2):650-669

[6] Bordenabe NE, Chatzikokolakis K, Palamidessi C (2014) Optimal geo-indistinguishable mechanisms for location privacy. In: Proceedings of the 21st ACM Conference on Computer and Communications Security (CCS’14), pp 251-262

[7] Chatzikokolakis K, ElSalamouny E, Palamidessi C (2017) Practical mechanisms for location privacy. Proceedings on Privacy Enhancing Technologies (PoPETs) 2017(4):210-231

[8] Chelmis C, Kolte J, Prasanna VK (2015) Big data analytics for demand response: Clustering over space and time. In: Proceedings of 2015 IEEE International Conference on Big Data (BigData’15), pp 2223-2232

[9] Cover TM, Thomas JA (2006) Elements of Information Theory, Second Edition. Wiley-Interscience

10] Data Breaches Increase 40 Percent in 2016, Finds New Report from Identity Theft Resource Center and CyberScout (2017) http://www.idtheftcenter.org/2016databreaches.html

[11] Duchi JC, Jordan MI, Wainwright MJ (2013) Local privacy and statistical minimax rates. In: Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS’13), pp 429-438

[12] Dwork C (2006) Differential privacy. In: Proceedings of the 33rd international conference on Automata, Languages and Programming (ICALP’06), pp 1-12

[13] Dwork C, Roth A (2014) The Algorithmic Foundations of Differential Privacy. Now Publishers

[14] Eagle N, Pentland A, Lazer D (2009) Inferring friendship network structure by using mobile phone data. Proceedings of the National Academy of Sciences (PNAS) 106(36):15,274-15,278

[15] Efron B, Hastie T (2016) Computer Age Statistical Inference. Cambridge University

[16] Eltarjaman W, Dewri R, Thurimella R (2017) Location privacy for rank-based geo-query systems. Proceedings on Privacy Enhancing Technologies (PoPETs) 2017(4):19-38

[17] Úlfar Erlingsson, Pihur V, Korolova A (2014) RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS’14), pp 1054-1067

[18] Freudiger J, Shokri R, Hubaux JP (2011) Evaluating the privacy risk of location-based services. In: Proceedings of the 15th international conference on Financial Cryptography and Data Security (FC’11), pp 31-46

[19] Gelman A, Meng XL, Stern H (1996) Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinca 6:733-807

[20] Groetcsh C (1984) The Theory of Tikhonov Regularization for Fredholm Equations of the First Kind. Pitman Advanced Publishing Program

[21] Hasan O, Habegger B, Brunie L, Bennani N, Damiani E (2009) A discussion of privacy challenges in user profiling with big data techniques: The EEXCESS use case. In: Proceedings of 2013 IEEE International Congress on Big Data (BigData Congress’13), pp 25-30

[22] Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. Spinger, 2nd edition

[23] Hino H, Shen H, Murata N, Wakao S, Hayashi Y (2013) A versatile clustering method for electricity consumption pattern analysis in households. IEEE Transactions on Smart Grid 4(2):1048-1057

[24] Hsu J, Gaboardi M, Haeberlen A, Khanna S, Narayan A, Pierce BC, Roth A (2014) Differential privacy: An economic method for choosing epsilon. In: Proceedings of the 2014 IEEE 27th Computer Security Foundations Symposium (CSF’14), pp 398-410

[25] Huang Z, Du W (2008) OptRR: Optimizing randomized response schemes for privacy-preserving data mining. In: Proceedings of IEEE 24th International Conference on Data Engineering (ICDE’08), pp 705-714

[26] Hull B, Bychkovsky V, Zhang Y, Chen K, Goraczko M (2006) CarTel: A distributed mobile sensor computing system. In: Proceedings of the 4th International Conference on Embedded Networked Sensor Systems (SenSys’06), pp 125-138

[27] Ishiguro M, Sakamoto Y, Kitagawa G (1997) Bootstrapping log likelihood and EIC, an extension of AIC. Annals of the Institute of Statistical Mathematics 49(3):411-434

[28] Johansen TA (1997) On Tikhonov regularization, bias and variance in nonlinear system identification. Automatica (3):441-446

[29] Jorgensen Z, Yu T, Cormode G (2015) Conservative or liberal? Personalized differential privacy. In: Proceedings of IEEE 31st International Conference on Data Engineering (ICDE’15), pp 1023-1034

[30] Kairouz P, Bonawitz K, Ramage D (2016) Discrete distribution estimation under local privacy. In: Proceedings of the 33rd International Conference on Machine Learning (ICML’16), pp 2436-2444

[31] Kairouz P, Oh S, Viswanath P (2016) Extremal mechanisms for local differential privacy. Journal of Machine Learning Research 17(1):492-542

[32] Li N, Lyu M, Su D (2016) Differential Privacy: From Theory to Practice. Morgan & Claypool Publishers

[33] Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml

[34] Lin J (1991) Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory 37(1):145- 151

[35] Lisovich MA, Mulligan DK, Wicker SB (2010) Inferring personal information from demand-response systems. IEEE Security & Privacy 8(1):11-20

[36] Matsuo Y, Okazaki N, Izumi K, Nakamura Y, Nishimura T, Hasida K (2007) Inferring long-term user properties based on users’ location history. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’ 07), pp 2159-2165

[37] Pastore A, Gastpar M (2016) Locally differentially-private distribution estimation. In: Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT’16), pp 2694-2698

[38] Pyrgelis A, Troncoso C, Cristofaro ED (2017) What does the crowd say about you? Evaluating aggregation-based location privacy. Proceedings on Privacy Enhancing Technologies (PoPETs) 2017(4):76-96

[39] Qin Z, Yang Y, Yu T, Khalil I, Xiao X, Ren K (2016) Heavy hitter estimation over set-valued data with local differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS’16), pp 192-203

[40] Quercia D, Leontiadis I, McNamara L, Mascolo C, Crowcroft J (2011) Spotme if you can: Randomized responses for location obfuscation on mobile phones. In: Proceedings of the 2011 31st International Conference on Distributed Computing Systems (ICDCS’11), pp 363-372

[41] Rilstone P, Srivastava V, Ullah A (1996) The second-order bias and mean squared error of nonlinear estimators. Journal of Economics 75(2):369-395

[42] Schiaffino S, Amandi A (2009) Intelligent user profiling. In: Bramer M (ed) Artificial Intelligence, Springer-Verlag, pp 193-216

[43] Schubert E, Zimek A, Kriegel HP (2014) Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 14th SIAM International Conference on Data Mining (SDM’14), pp 542-550

[44] Sei Y, Ohusuga A (2017) Differential private data collection and analysis based on randomized multiple dummies for untrusted mobile crowdsensing. IEEE Transactions on Information Forensics and Security 12(4):926-939

[45] Sekimoto Y, Shibasaki R, Kanasugi H, Usui T, Shimazaki Y (2011) Pflow: Reconstructing people flow recycling large-scale social survey data. IEEE Pervasive Computing 10(4):27-35

[46] Shekhar S, Evans MR, Gunturi V, Yang K (2012) Spatial big-data challenges intersecting mobility and cloud computing. In: Proceedings of the 11th ACM International Workshop on Data Engineering for Wireless and Mobile Access (MobiDE’12), pp 1-12

[47] van der Vaart AW (1998) Asymptotic Statistics. Cambridge University Press

[48] Wang W, Carreira-Perpiñán MÁ (2013) Projection onto the probability simplex: An efficient algorithm with a simple proof, and an application. CoRR abs/1309.1541, URL http://arxiv.org/abs/1309.1541

[49] Warner SL (1965) Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association 60(309):63-69

[50] Yang D, Zhang D, Qu B (2016) Participatory cultural mapping based on collective behavior data in location based social network. ACM Transactions on Intelligent Systems and Technology 7(3):30:1-30:23

[51] Zheng Y, Zhang L, Xie X, Ma WY (2009) Mining interesting locations and travel sequences from GPS trajectories. In: Proceedings of the 18th International Conference on World Wide Web (WWW’09), pp 791-800

Journal Information

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 421 421 61
PDF Downloads 203 203 37