Practical Privacy-Preserving K-means Clustering

Payman Mohassel 1 , Mike Rosulek 2 ,  and Ni Trieu 3
  • 1 , Facebook, Work done partially while at Visa Research.
  • 2 Oregon State University, , Partially supported by NSF award 1617197, a Google faculty award, and a Visa faculty award.
  • 3 University of California, , Work done partially while at Oregon State University and Visa Research, Berkeley

Abstract

Clustering is a common technique for data analysis, which aims to partition data into similar groups. When the data comes from different sources, it is highly desirable to maintain the privacy of each database. In this work, we study a popular clustering algorithm (K-means) and adapt it to the privacypreserving context.

Specifically, to construct our privacy-preserving clustering algorithm, we first propose an efficient batched Euclidean squared distance computation protocol in the amortizing setting, when one needs to compute the distance from the same point to other points. Furthermore, we construct a customized garbled circuit for computing the minimum value among shared values.We believe these new constructions may be of independent interest. We implement and evaluate our protocols to demonstrate their practicality and show that they are able to train datasets that are much larger and faster than in the previous work. The numerical results also show that the proposed protocol achieve almost the same accuracy compared to a K-means plain-text clustering algorithm.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] http://mint.sbg.ac.at/.

  • [2] https://github.com/deric/clustering-benchmark.

  • [3] https://github.com/encryptogroup/aby/.

  • [4] https://github.com/ladnir/ivory-runtime.

  • [5] http://www.unimarburg.de/fb12/datenbionik/downloads/fcps.

  • [6] Tfhe library: https://tfhe.github.io/tfhe.

  • [7] Gilad Asharov, Yehuda Lindell, Thomas Schneider, and Michael Zohner. More efficient oblivious transfer and extensions for faster secure computation. In ACM CCS 13, 2013.

  • [8] Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang. Differentially private clustering in high-dimensional Euclidean spaces. In Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, 2017.

  • [9] Mihir Bellare, Viet Tung Hoang, Sriram Keelveedhi, and Phillip Rogaway. Efficient garbling from a fixed-key blockcipher. In 2013 IEEE Symposium on Security and Privacy, 2013.

  • [10] Marina Blanton and Paolo Gasti. Secure and efficient protocols for iris and fingerprint identification. In ESORICS 2011.

  • [11] Paul Bunn and Rafail Ostrovsky. Secure two-party k-means clustering. In ACM CCS 07.

  • [12] Hao Chen, Ilaria Chillotti, Yihe Dong, Oxana Poburinnaya, Ilya Razenshteyn, and M. Sadegh Riazi. Sanns: Scaling up secure approximate k-nearest neighbors search. Cryptology ePrint Archive, Report 2019/359.

  • [13] Ivan Damgαrd, Valerio Pastro, Nigel P. Smart, and Sarah Zakarias. Multiparty computation from somewhat homomorphic encryption. In CRYPTO 2012.

  • [14] Daniel Demmler, Thomas Schneider, and Michael Zohner. ABY - A framework for efficient mixed-protocol secure twoparty computation. In NDSS 2015.

  • [15] Ghada Dessouky, Farinaz Koushanfar, Ahmad-Reza Sadeghi, Thomas Schneider, Shaza Zeitouni, and Michael Zohner. Pushing the communication barrier in secure computation using lookup tables. In NDSS 2017.

  • [16] Pasi Fränti and Sami Sieranoja. K-means properties on six clustering benchmark datasets, 2018.

  • [17] Craig Gentry. A Fully Homomorphic Encryption Scheme. PhD thesis, Stanford, CA, USA, 2009. AAI3382729.

  • [18] Z. Gheid and Y. Challal. Efficient and privacy-preserving k-means clustering for big data mining. In 2016 IEEE Trustcom/ BigDataSE/ISPA, pages 791–798, Aug 2016.

  • [19] Niv Gilboa. Two party RSA key generation. In CRYPTO’99.

  • [20] Oded Goldreich, Silvio Micali, and Avi Wigderson. How to play any mental game or A completeness theorem for protocols with honest majority. In 19th ACM STOC.

  • [21] S. Dov Gordon, Jonathan Katz, Vladimir Kolesnikov, Fernando Krell, Tal Malkin, Mariana Raykova, and Yevgeniy Vahlis. Secure two-party computation in sublinear (amortized) time. In ACM CCS 12.

  • [22] Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams: Theory and practice. IEEE Trans. on Knowl. and Data Eng., 15(3):515–528, March 2003.

  • [23] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD ’98.

  • [24] Yan Huang, Lior Malka, David Evans, and Jonathan Katz. Efficient privacy-preserving biometric identification. In NDSS 2011. The Internet Society, February 2011.

  • [25] Yuval Ishai, Joe Kilian, Kobbi Nissim, and Erez Petrank. Extending oblivious transfers efficiently. In CRYPTO 2003.

  • [26] Geetha Jagannathan, Krishnan Pillaipakkamnatt, Rebecca N. Wright, and Daryl Umano. Communicationefficient privacy-preserving clustering. Trans. Data Privacy, 3(1):1–25, April 2010.

  • [27] Geetha Jagannathan and Rebecca N. Wright. Privacypreserving distributed k-means clustering over arbitrarily partitioned data. KDD ’05.

  • [28] Angela Jäschke and Frederik Armknecht. Unsupervised machine learning on encrypted data. In Carlos Cid and Michael J. Jacobson Jr., editors, Selected Areas in Cryptography – SAC 2018.

  • [29] Somesh Jha, Luis Kruger, and Patrick McDaniel. Privacy preserving clustering. In Sabrina de Capitani di Vimercati, Paul Syverson, and Dieter Gollmann, editors, Computer Security – ESORICS 2005.

  • [30] Zoe Jiang, Ning Guo, Yabin Jin, Jiazhuo Lv, Yulin Wu, Yating Yu, Xuan Wang, Sm Yiu, and Junbin Fang. Efficient two-party privacy preserving collaborative k-means clustering protocol supporting both storage and computation outsourcing: 18th international conference, ica3pp 2018. pages 447–460, 11 2018.

  • [31] K. Järvinen, H. Leppäkoski, E. Lohan, P. Richter, T. Schneider, O. Tkachenko, and Z. Yang. Pilot: Practical privacypreserving indoor localization using outsourcing. In 2019 IEEE EuroS P.

  • [32] Seny Kamara, Payman Mohassel, and Mariana Raykova. Outsourcing multi-party computation. Cryptology ePrint Archive, Report 2011/272.

  • [33] Vladimir Kolesnikov and Ranjit Kumaresan. Improved OT extension for transferring short secrets. In CRYPTO 2013, Part II.

  • [34] Vladimir Kolesnikov, Ranjit Kumaresan, Mike Rosulek, and Ni Trieu. Efficient batched oblivious PRF with applications to private set intersection. In ACM CCS 16.

  • [35] Vladimir Kolesnikov, Jesper Buus Nielsen, Mike Rosulek, Ni Trieu, and Roberto Trifiletti. DUPLO: Unifying cut-andchoose for garbled circuits. In ACM CCS 17.

  • [36] Vladimir Kolesnikov and Thomas Schneider. Improved garbled circuit: Free xor gates and applications. In Automata, Languages and Programming, 2008.

  • [37] X. Liu, Z. L. Jiang, S. M. Yiu, X. Wang, C. Tan, Y. Li, Z. Liu, Y. Jin, and J. Fang. Outsourcing two-party privacy preserving k-means clustering protocol in wireless sensor networks. In 2015 11th International Conference on Mobile Ad-hoc and Sensor Networks (MSN), 2015.

  • [38] Xianrui Meng, Dimitrios Papadopoulos, Alina Oprea, and Nikos Triandopoulos. Privacy-preserving hierarchical clustering: Formal security and efficient approximation. CoRR, abs/1904.04475.

  • [39] Payman Mohassel and Yupeng Zhang. SecureML: A system for scalable privacy-preserving machine learning. In 2017 IEEE Symposium on Security and Privacy.

  • [40] Valeria Nikolaenko, Stratis Ioannidis, Udi Weinsberg, Marc Joye, Nina Taft, and Dan Boneh. Privacy-preserving matrix factorization. In ACM CCS 13.

  • [41] Valeria Nikolaenko, Udi Weinsberg, Stratis Ioannidis, Marc Joye, Dan Boneh, and Nina Taft. Privacy-preserving ridge regression on hundreds of millions of records. In 2013 IEEE Symposium on Security and Privacy.

  • [42] Michele Orru, Emmanuela Orsini, and Peter Schol. Actively secure 1-out-of-n ot extension with application to private set intersection. In CT-RSA, 2017.

  • [43] Sankita Patel, Sweta Garasia, and Devesh Jinwala. An efficient approach for privacy preserving distributed k-means clustering based on shamir’s secret sharing scheme. In Theo Dimitrakos, Rajat Moona, Dhiren Patel, and D. Harrison McKnight, editors, Trust Management VI, 2012.

  • [44] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

  • [45] Benny Pinkas, Mike Rosulek, Ni Trieu, and Avishay Yanai. Spot-light: Lightweight private set intersection from sparse ot extension. In Advances in Cryptology – CRYPTO 2019, 2019.

  • [46] Benny Pinkas, Mike Rosulek, Ni Trieu, and Avishay Yanai. Psi from paxos: Fast, malicious private set intersection. In Advances in Cryptology – EUROCRYPT 2020, 2020.

  • [47] Benny Pinkas, Thomas Schneider, and Michael Zohner. Scalable private set intersection based on ot extension. In ACM TOPS, 2018.

  • [48] F. Rao, B. K. Samanthula, E. Bertino, X. Yi, and D. Liu. Privacy-preserving and outsourced multi-user k-means clustering. In 2015 IEEE Conference on Collaboration and Internet Computing (CIC), pages 80–89, Oct 2015.

  • [49] Peter Rindal. libOTe: an efficient, portable, and easy to use Oblivious Transfer Library. https://github.com/osucrypto/libOTe.

  • [50] Ahmad-Reza Sadeghi, Thomas Schneider, and Immo Wehrenberg. Efficient privacy-preserving face recognition. In Proceedings of the 12th International Conference on Information Security and Cryptology, ICISC’09.

  • [51] Phillipp Schoppmann, Adrià Gascón, and Borja Balle. Private nearest neighbors classification in federated databases. Cryptology ePrint Archive, Report 2018/289, 2018.

  • [52] Arlei Silva and Gowtham Bellala. Privacy-preserving multiparty clustering: An empirical study. 2017.

  • [53] Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, and Hongxia Jin. Differentially private k-means clustering. CODASPY ’16.

  • [54] Dong Su, Jianneng Cao, Ninghui Li, Elisa Bertino, Min Lyu, and Hongxia Jin. Differentially private k-means clustering and a hybrid approach to private optimization. ACM Trans. Priv. Secur., 20(4):16:1–16:33, October 2017.

  • [55] Jaideep Vaidya and Chris Clifton. Privacy-preserving kmeans clustering over vertically partitioned data. KDD ’03.

  • [56] Xiao Wang, Alex J. Malozemoff, and Jonathan Katz. EMPtoolkit: Efficient MultiParty computation toolkit. https://github.com/emp-toolkit/emp-tool/blob/master/emptool/circuits/float32_circuit.hpp#L37.

  • [57] Xiao Wang, Samuel Ranellucci, and Jonathan Katz. Authenticated garbling and efficient maliciously secure two-party computation. In ACM CCS 17.

  • [58] K. Xing, C. Hu, J. Yu, X. Cheng, and F. Zhang. Mutual privacy preserving k -means clustering in social participatory sensing. IEEE Transactions on Industrial Informatics, 13(4):2066–2076, Aug 2017.

  • [59] Andrew Chi-Chih Yao. How to generate and exchange secrets (extended abstract). In 27th FOCS.

  • [60] J. Yuan and Y. Tian. Practical privacy-preserving mapreduce based k-means clustering over large-scale dataset. IEEE Transactions on Cloud Computing, 2019.

  • [61] Samee Zahur, Mike Rosulek, and David Evans. Two halves make a whole - reducing data transfer in garbled circuits using half gates. In EUROCRYPT 2015, Part II.

  • [62] Jun Zhang, Xiaokui Xiao, and Xing Xie. Privtree: A differentially private algorithm for hierarchical decompositions. SIGMOD ’16.

  • [63] Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: A new data clustering algorithm and its applications. Data Mining and Knowledge Discovery, 1(2):141–182, Jun 1997.

OPEN ACCESS

Journal + Issues

Search