Secure and Scalable Document Similarity on Distributed Databases: Differential Privacy to the Rescue

Phillipp Schoppmann 1 , Lennart Vogelsang 2 , Adrià Gascón 3 , and Borja Balle 4
  • 1 Humboldt-Universität zu Berlin, Alexander von Humboldt Institute for Internet and Society, Berlin, Germany
  • 2 Humboldt-Universität zu Berlin, Alexander von Humboldt Institute for Internet and Society, Berlin, Germany
  • 3 Alan Turing Institute, , London
  • 4 Amazon Research,


Privacy-preserving collaborative data analysis enables richer models than what each party can learn with their own data. Secure Multi-Party Computation (MPC) offers a robust cryptographic approach to this problem, and in fact several protocols have been proposed for various data analysis and machine learning tasks. In this work, we focus on secure similarity computation between text documents, and the application to k-nearest neighbors (k-NN) classification. Due to its non-parametric nature, k-NN presents scalability challenges in the MPC setting. Previous work addresses these by introducing non-standard assumptions about the abilities of an attacker, for example by relying on non-colluding servers. In this work, we tackle the scalability challenge from a different angle, and instead introduce a secure preprocessing phase that reveals differentially private (DP) statistics about the data. This allows us to exploit the inherent sparsity of text data and significantly speed up all subsequent classifications.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • [1] Aspell dictionary creation.

  • [2] Amazon Product Data.

  • [3] Multi-protocol SPDZ., 2018.

  • [4] 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015.

  • [5] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL available from

  • [6] M. Al-Rubaie, P. Y. Wu, J. M. Chang, and S. Kung. Privacy-preserving PCA on horizontally-partitioned data. In DSC, pages 280–287, 2017.

  • [7] M. Bafna and J. Ullman. The price of selection in differential privacy. In COLT, pages 151–168. PMLR, 2017.

  • [8] K. E. Batcher. Sorting Networks and Their Applications. In AFIPS Spring Joint Computing Conference, volume 32, pages 307–314, 1968.

  • [9] D. Beaver. Efficient Multiparty Protocols Using Circuit Randomization. In CRYPTO, pages 420–432, 1991.

  • [10] J. Beel, B. Gipp, S. Langer, and C. Breitinger. Paper recommender systems: a literature survey. International Journal on Digital Libraries, 17(4):305–338, 2016.

  • [11] A. Beimel, K. Nissim, and E. Omri. Distributed private data analysis: Simultaneously solving how and what. In CRYPTO, pages 451–468. Springer, 2008.

  • [12] A. Ben-Efraim, Y. Lindell, and E. Omri. Optimizing Semi-Honest Secure Multiparty Computation for the Internet. In ACM Conference on Computer and Communications Security, pages 578–590, 2016.

  • [13] C. Blundo, E. D. Cristofaro, and P. Gasti. EsPRESSo: Effi-cient privacy-preserving evaluation of sample set similarity. In DPM/SETOP, pages 89–103, 2012.

  • [14] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser. Machine learning classification over encrypted data. In NDSS, 2015.

  • [15] S. Buyrukbilen and S. Bakiras. Secure similar document detection with simhash. In Secure Data Management, pages 61–75, 2013.

  • [16] R. Canetti. Security and composition of multiparty cryptographic protocols. J. Cryptology, 13(1):143–202, 2000.

  • [17] K. Chaudhuri and S. Dasgupta. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, pages 3437–3445, 2014.

  • [18] K. Chaudhuri and S. A. Vinterbo. A stability-based validation procedure for differentially private machine learning. In NIPS, pages 2652–2660, 2013.

  • [19] H. Chen, I. Chillotti, Y. Dong, O. Poburinnaya, I. P. Razenshteyn, and M. S. Riazi. SANNS: scaling up secure approximate k-nearest neighbors search. IACR Cryptology ePrint Archive, 2019:359, 2019.

  • [20] I. Damgård, V. Pastro, N. P. Smart, and S. Zakarias. Multi-party computation from somewhat homomorphic encryption. In CRYPTO, volume 7417, 2012.

  • [21] J. Doerner. The Absentminded Crypto Kit. URL

  • [22] J. Doerner and A. Shelat. Scaling ORAM for Secure Computation. In CCS, pages 523–535, 2017.

  • [23] C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, Aug. 2014.

  • [24] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our Data, Ourselves: Privacy Via Distributed Noise Generation. In EUROCRYPT, pages 486–503, 2006.

  • [25] C. Dwork, F. McSherry, K. Nissim, and A. D. Smith. Calibrating Noise to Sensitivity in Private Data Analysis. In TCC, pages 265–284, 2006.

  • [26] A. Efros. How to stop worrying and learn to love nearest neighbors. NIPS workshop on Nearest Neighbors for Modern Applications with Massive Data, 2017.

  • [27] M. Fanaeepour and B. I. P. Rubinstein. Histogramming privately ever after: Differentially-private data-dependent error bound optimisation. In ICDE, pages 1204–1207. IEEE Computer Society, 2018.

  • [28] A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, and D. Evans. Privacy-Preserving Distributed Linear Regression on High-Dimensional Data. Proceedings on Privacy Enhancing Technologies, 2017(4):345–364, Oct. 2017.

  • [29] N. Gilboa. Two party RSA key generation. In CRYPTO, pages 116–129. Springer, 1999.

  • [30] O. Goldreich. The Foundations of Cryptography – Volume 2, Basic Applications. 2004.

  • [31] A. Groce, P. Rindal, and M. Rosulek. Cheaper private set intersection via differentially private leakage. PoPETs, 2019 (3):6–25, 2019.

  • [32] G. Guennebaud, B. Jacob, et al. Eigen v3., 2010.

  • [33] R. He and J. J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, pages 507–517. ACM, 2016.

  • [34] X. He, A. Machanavajjhala, C. J. Flynn, and D. Srivastava. Composing differential privacy and secure computation: A case study on scaling private record linkage. In ACM Conference on Computer and Communications Security, pages 1389–1406, 2017.

  • [35] Y. Huang, D. Evans, and J. Katz. Private set intersection: Are garbled circuits better than custom protocols? In NDSS, 2012.

  • [36] W. Jiang, M. Murugesan, C. Clifton, and L. Si. Similar document detection with limited information disclosure. In ICDE, pages 735–743, 2008.

  • [37] C. Juvekar, V. Vaikuntanathan, and A. Chandrakasan. Gazelle: A Low Latency Framework for Secure Neural Network Inference. IACR Cryptology ePrint Archive, 2018:73, 2018.

  • [38] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.

  • [39] F. Li, R. Shin, and V. Paxson. Exploring privacy preservation in outsourced k-nearest neighbors with multiple data owners. In CCSW, pages 53–64, 2015.

  • [40] Y. Lindell. How to simulate it - A tutorial on the simulation proof technique. In Tutorials on the Foundations of Cryptography, pages 277–346. Springer International Publishing, 2017.

  • [41] Y. Lindell and B. Pinkas. A Proof of Security of Yao’s Protocol for Two-Party Computation. J. Cryptology, 22(2): 161–188, 2009.

  • [42] J. Liu, M. Juuti, Y. Lu, and N. Asokan. Oblivious neural network predictions via minionn transformations. In CCS, pages 619–631, 2017.

  • [43] C. Manning, P. Raghavan, and H. Schütze. Scoring, term weighting and the vector space model. In Introduction to information retrieval, pages 100–122. 2008.

  • [44] S. Mazloom and S. D. Gordon. Secure computation with differentially private access patterns. In ACM Conference on Computer and Communications Security, pages 490–507, 2018.

  • [45] F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, pages 94–103, 2007.

  • [46] I. Mironov, O. Pandey, O. Reingold, and S. P. Vadhan. Computational differential privacy. In CRYPTO, pages 126–142, 2009.

  • [47] P. Mohassel and Y. Zhang. SecureML: A system for scalable privacy-preserving machine learning. In IEEE Symposium on Security and Privacy, pages 19–38, 2017.

  • [48] M. Murugesan, W. Jiang, C. Clifton, L. Si, and J. Vaidya. Efficient privacy-preserving similar document detection. VLDB J., 19(4):457–475, 2010.

  • [49] K. Nayak, X. S. Wang, S. Ioannidis, U. Weinsberg, N. Taft, and E. Shi. GraphSC: Parallel secure computation made easy. In IEEE Symposium on Security and Privacy, pages 377–394. IEEE Computer Society, 2015.

  • [50] V. Nikolaenko, S. Ioannidis, U. Weinsberg, M. Joye, N. Taft, and D. Boneh. Privacy-preserving matrix factorization. In ACM Conference on Computer and Communications Security, pages 801–812. ACM, 2013.

  • [51] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft. Privacy-preserving ridge regression on hundreds of millions of records. In IEEE Symposium on Security and Privacy, pages 334–348, 2013.

  • [52] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. VanderPlas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

  • [53] D. M. W. Powers. Applications and explanations of zipf’s law. In CoNLL, pages 151–160, 1998.

  • [54] M. S. Riazi, B. Chen, A. Shrivastava, D. S. Wallach, and F. Koushanfar. Sub-linear privacy-preserving search with untrusted server and semi-honest parties. CoRR, abs/1612.01835, 2016.

  • [55] H. Rong, H. Wang, J. Liu, and M. Xian. Privacy-preserving k-nearest neighbor computation in multiple cloud environments. IEEE Access, 4:9589–9603, 2016.

  • [56] A. Waksman. A permutation network. J. ACM, 15(1): 159–163, 1968.

  • [57] X. Wang, T.-H. H. Chan, and E. Shi. Circuit ORAM: On Tightness of the Goldreich-Ostrovsky Lower Bound. In ACM Conference on Computer and Communications Security, pages 850–861, 2015.

  • [58] X. Wang, A. J. Malozemoff, and J. Katz. EMP-toolkit: Efficient MultiParty computation toolkit., 2016.

  • [59] A. C.-C. Yao. How to Generate and Exchange Secrets (Extended Abstract). In FOCS, pages 162–167, 1986.

  • [60] S. Zahur and D. Evans. Obliv-C: A Language for Extensible Data-Oblivious Computation. IACR Cryptology ePrint Archive, 2015:1153, 2015.

  • [61] S. Zahur, X. S. Wang, M. Raykova, A. Gascón, J. Doerner, D. Evans, and J. Katz. Revisiting Square-Root ORAM: Efficient Random Access in Multi-party Computation. In IEEE Symposium on Security and Privacy, pages 218–234, 2016.


Journal + Issues