A Simple Method for Limiting Disclosure in Continuous Microdata Based on Principal Component Analysis

Aida Calviño 1 , 2
  • 1 Department of Computer Science and Mathematics, Universitat Rovira i Virgili, 43007 Tarragona, Spain Spain
  • 2 Department of Statistics and Operations Research III, Complutense University of Madrid, 28040 Madrid, Spain


In this article we propose a simple and versatile method for limiting disclosure in continuous microdata based on Principal Component Analysis (PCA). Instead of perturbing the original variables, we propose to alter the principal components, as they contain the same information but are uncorrelated, which permits working on each component separately, reducing processing times. The number and weight of the perturbed components determine the level of protection and distortion of the masked data. The method provides preservation of the mean vector and the variance-covariance matrix. Furthermore, depending on the technique chosen to perturb the principal components, the proposed method can provide masked, hybrid or fully synthetic data sets. Some examples of application and comparison with other methods previously proposed in the literature (in terms of disclosure risk and data utility) are also included.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Banu, R. and N. Nagaveni. 2009. “Preservation of Data Privacy Using PCA Based Transformation.” In International Conference on Advances in Recent Technologies in Communication and Computing, 439-443. Doi: http://dx.doi.org/10.1109/ARTCom.2009.159.

  • Brand, R. 2002. “Microdata Protection through Noise Addition.” In Inference Control in Statistical Databases, edited by J. Domingo-Ferrer. Lecture Notes in Computer Science, 2316: 97-116. Berlin Heidelberg: Springer. Doi: http://dx.doi.org/10.1007/3-540-47804-38.

  • Brand, R., J. Domingo-Ferrer, and J. Mateo-Sanz. 2002. Reference Data Sets to Test and Compare SDC Methods for Protection of Numerical Microdata. Deliverable of European Project IST-2000-25069 CASC. Available at: http://neon.vb.cbs.nl/casc (accessed August 2016).

  • Burridge, J. 2003. “Information Preserving Statistical Obfuscation.” Statistics and Computing 13: 321-327. Doi: http://dx.doi.org/10.1023/A:1025658621216.

  • Domingo-Ferrer, J. and U. Gonza´lez-Nicola´s. 2010. “Hybrid Microdata Using Microaggregation.” Information Sciences 180: 2834-2844. Doi: http://dx.doi.org/10.1016/j.ins.2010.04.005.

  • Domingo-Ferrer, J. and V. Torra. 2001. “A Quantitative Comparison of Disclosure Control Methods for Microdata.” In Confidentiality, disclosure, and data access: Theory and practical applications for statistical agencies, edited by P. Doyle, J. Lane, J. Theeuwes, and L. Zayatz. 111-133. Elsevier. Available at: https://www.iiia.csic.es/es/publications/quantitativecomparison-disclosure-control-methods-microdata (accessed August 2016).

  • Domingo-Ferrer, J. and V. Torra. 2004. “Disclosure Risk Assessment in Statistical Data Protection.” Journal of Computational and Applied Mathematics 164: 285-293. Doi: http://dx.doi.org/10.1016/S0377-0427(03)00643-5.

  • Drechsler, J. 2011. Synthetic datasets for statistical disclosure control: theory and implementation, volume 201. Springer Science & Business Media.

  • Duncan, G. and R. Pearson. 1991. “Enhancing Access to Microdata While Protecting Confidentiality: Prospects for the Future.” Statistical Science 6: 219-239.

  • Efron, B. and R. Tibshirani. 1993. An introduction to the Bootstrap. New York: Chapman and Hall.

  • Fienberg, S. 1994. A Radical Proposal for the Provision of Micro-Data Samples and the Preservation of Confidentiality. Technical Report 611, Department of Statistics, Carnegie Mellon University.

  • Hundepool, A., J. Domingo-Ferrer, L. Franconi, S. Giessing, E. Nordholt, K. Spicer, and P. de Wolf. 2012. Statistical Disclosure Control. Chichester, UK: John Wiley & Sons.

  • Jiménez, J., G. Navarro-Arribas, and V. Torra. 2014. “JPEG-Based Microdata Protection.” In Privacy in Statistical Databases, edited by J. Domingo-Ferrer. Lecture Notes in Computer Science, 8744: 117-129. Springer International Publishing. Doi: http://dx. doi.org/10.1007/978-3-319-11257-210.

  • Jolliffe, I. 2002. Principal Component Analysis. New York, USA: Springer.

  • Kim, H., A. Karr, and J. Reiter. 2015. “Statistical Disclosure Limitation in the Presence of Edit Rules.” Journal of Official Statistics 31: 121-138. Doi: http://dx.doi.org/10.1515/jos-2015-0006.

  • Liew, C., U. Choi, and C. Liew. 1985. “A Data Distortion by Probability Distribution.” ACM Transactions Database Systems 10: 395-411.

  • Mateo-Sanz, J., J. Domingo-Ferrer, and F. Sebe´. 2005. “Probabilistic Information Loss Measures in Confidentiality Protection of Continuous Microdata.” Data Mining and Knowledge Discovery 11: 181-193. Doi: http://dx.doi.org/10.1007/s10618-005-0011-9.

  • Moore, R. 1996. Controlled Data Swapping Techniques for Masking Public use Microdata Sets. Technical report, U.S. Bureau of the Census, Washington, D.C. Available at: https://www.census.gov/srd/papers/pdf/rr96-4.pdf (accessed August 2016).

  • Muralidhar, K. and R. Sarathy. 2008. “Generating Sufficiency-Based Non-Synthetic Perturbed Data.” Transactions on Data Privacy 1: 17-33. Available: at http://www.tdp.cat/issues/tdp.a005a08.pdf (accessed August 2016).

  • Muralidhar, K., R. Sarathy, and J. Domingo-Ferrer. 2014. “Reverse Mapping to Preserve the Marginal Distributions of Attributes in Masked Microdata.” In Privacy in Statistical Databases, edited by J. Domingo-Ferrer. Lecture Notes in Computer Science, 8744: 105-116. Springer International Publishing. Doi: http://dx.doi.org/10.1007/978-3-319-11257-29.

  • Oganian, A. and A. Karr. 2006. “Combinations of SDC Methods for Microdata Protection.” In Privacy in Statistical Databases, edited by J. Domingo-Ferrer and L. Franconi. Lecture Notes in Computer Science, 4302: 102-113. Berlin Heidelberg: Springer. Doi: http://dx.doi.org/10.1007/1193024210.

  • Pagliuca, D. and G. Seri. 1999. Some Results of Individual Ranking Method on the System of Enterprise Accounts Annual Survey. Esprit SDC Project, Deliverable MI-3/D2.

  • R Core Team. 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Website: http://www.R-project.org/.

  • Raghunathan, T.E., J. Reiter, and D. Rubin. 2003. “Multiple Imputation for Statistical Disclosure Limitation.” Journal of Official Statistics 19: 1-16.

  • Rubin, D. 1993. “Statistical Disclosure Limitation.” Journal of Official Statistics 9: 461-468.

  • Sarathy, R. and M. Krishnamurty. 2002. “The Security of Confidential Numerical Data in Databases.” Information Systems Research 13: 389-403. Doi: http://dx.doi.org/10.1287/isre.13.4.389.74.

  • Templ, M. 2008. “Statistical Disclosure Control for Microdata Using the Rpackage sdcMicro.” Transactions on Data Privacy 1: 67-85. Doi: http://dx.doi.org/10.18637/jss.v067.i04.

  • Woo, M., J. Reiter, A. Oganian, and A. Karr. 2009. “Global Measures of Data Utility for Microdata Masked for Disclosure Limitation.” Journal of Privacy and Confidentiality 1: 111-124.


Journal + Issues