End User Licence to Open Government Data? A Simulated Penetration Attack on Two Social Survey Datasets

Open access


In the UK, the transparency agenda is forcing data stewardship organisations to review their dissemination policies and to consider whether to release data that is currently only available to a restricted community of researchers under licence as open data. Here we describe the results of a study providing evidence about the risks of such an approach via a simulated attack on two social survey datasets. This is also the first systematic attempt to simulate a jigsaw identification attack (one using a mashup of multiple data sources) on an anonymised dataset. The information that we draw on is collected from multiple online data sources and purchasable commercial data. The results indicate that such an attack against anonymised end user licence (EUL) datasets, if converted into open datasets, is possible and therefore we would recommend that penetration tests should be factored into any decision to make datasets (that are about people) open.

Agarwal, A., K. Hosanagar, and M.D. Smith. 2011. “Location, Location, Location: An Analysis of Profitability of Position in Online Advertising Markets.” Journal of Marketing Research 48: 1057–1073. Doi: http://dx.doi.org/10.1509/jmr.08.0468.

Backstrom, L., C. Dwork, and J. Kleinberg. 2007. “Wherefore Art Thou r3579x?: Anonymized Social Networks, Hidden Patterns, and Structural Steganography.” In Proceedings of the 16th international conference on World Wide Web, 8–12 May 2007, Banff, AB, Canada. 181–190. Available at: http://dl.acm.org/citation.cfm?id=1242598 (accessed 9 November 2015).

Bar-Ilan, J., K. Keenoy, M. Levene, and E. Yaari. 2009. “Presentation Bias Is Significant in Determining User Preference for Search Results-A User Study.” Journal of the American Society for Information Science and Technology 60: 135–149. Doi: http://dx.doi.org/10.1002/asi.20941.

Boshmaf, Y., I. Muslukhov, K. Beznosov, and M. Ripeanu. 2013. “Design and Analysis of a Social Botnet.” Computer Networks 57: 556–578. Doi: http://dx.doi.org/10.1016/j.comnet.2012.06.006.

El Emam, K., E. Jonker, L. Arbuckle, and B. Malin. 2011. “A Systematic Review of Re-Identification Attacks on Health Data.” PloS one 6(12) : e28071. Doi: http://dx.doi.org/10.1371/journal.pone.0126772.

Elliot, M.J. 2009. “Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage.” In Proceedings of UNECE Work Session on Statistical Confidentiality, 17–19 December 2007, Manchester. Available at: http://www.unece.org/index.php?id=14503#/ (accessed 14 December 2014).

Elliot, M.J. and A. Dale. 1998. “Disclosure Risk for Microdata Report to the European Union ESP/204 62 361–372.” Available at: https://www.escholar.manchester.ac.uk/uk-ac-man-scw:19b497 (accessed 9 November 2015).

Elliot, M.J. and A. Dale. 1999. “Scenarios of Attack: the Data Intruder’s Perspective on Statistical Disclosure Risk.” Netherlands Official Statistics 14: 6–10. Available at: http://bit.ly/1ScX0cS (accessed 9 November 2015).

Elliot, M.J. and E. Mackey. 2014. “The Social Data Environment.” In Digital Enlightenment Yearbook, edited by K. O’Hara, S.L. David, D. de Roure, and C. M-H. Nguyen. 253–263. Doi: http://dx.doi.org/10.3233/978-1-61499-450-3-253.

Gymrek, M., A.L. McGuire, D. Golan, E. Halperin, and Y. Erlich. 2013. “Identifying Personal Genomes by Surname Inference.” Science 339: 321–324. http://dx.doi.org/10.1126/science.1229566.

Ma, Z.M., G. Pant, and O.R.L. Sheng. 2010. “Examining Organic and Sponsored Search Results: A Vendor Reliability Perspective.” Journal of Computer Information Systems 50: 30–38. Available at: http://bit.ly/1MSpcni (accessed 9 November 2015).

Mackey, E. 2009. A Framework for Understanding Statistical Disclosure Control Processes: A Case Study Using the UK’s Neighbourhood Statistics. PhD Thesis, University of Manchester. Available at: http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.502255 (accessed 9 November 2015).

Mackey, E. and M.J. Elliot. 2010. “The Application of Game Theory to Disclosure Events.” Proceedings of UNECE worksession on Statistical Confidentiality, Bilboa, December 2009. http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2009/wp.40.e.pdf (accessed 09/11/2015).

Mackey, E. and M.J. Elliot. 2013. “Understanding the Data Environment.” XRDS 20: 37–39. http://dx.doi.org/10.1145/2508973.

Malin, B. and L. Sweeney. 2004. “How (Not) to Protect Genomic Data Privacy in a Distributed Network: Using Trail Re-Identification to Evaluate and Design Anonymity Protection Systems.” Journal of Biomedical Informatics 37: 179–192. http://dx.doi.org/10.1016/j.jbi.2004.04.005.

Moreno, M.A., A. Grant, L. Kacvinsky, P. Moreno, and M. Fleming. 2012. “Older Adolescents’ Views Regarding Participation in Facebook Research.” Journal of Adolescent Health 51: 439–444. http://dx.doi.org/10.1016/j.jadohealth.2012.02.001.

Müller, W., U. Blien, and H. Wirth. 1995. “Identification Risks of Micro Data. Evidence from Experimental Studies.” Sociological Methods and Research 24: 131–157. http://dx.doi.org/10.1177/0049124195024002001.

Narayanan, A. and V. Shmatikov. 2008. “Robust De-Anonymization of Large Sparse Datasets.” In Proceedings of the 2008 IEEE Symposium on Security and Privacy, 18–21 May 2008, Berkeley/Oakland, CA, USA. 111–125. Doi: http://dx.doi.org/10.1109/SP.2008.33.

Narayanan, A. and V. Shmatikov. 2009. “De-Anonymizing Social Networks.” In Proceedings of the 2009 IEEE Symposium on Security and Privacy, 17–20 May 2009, Berkeley/Oakland, CA, USA. 173–187. Doi: http://dx.doi.org/10.1109/Sp.2009.22.

Obal, M. and W. Kunz. 2013. “Trust Development in E-Services: A Cohort Analysis of Millennials and Baby Boomers.” Journal of Service Management 24: 45–63. Doi: http://dx.doi.org/10.1108/09564231311304189.

Paass, G. 1988. “Disclosure Risk and Disclosure Avoidance for Microdata.” Journal of Business and Economic Statistics 6: 487–500. Doi: http://dx.doi.org/10.1080/07350015.1988.10509697.

Tarantino, E. 2013. “A Simple Model of Vertical Search Engines Foreclosure.” Telecommunications Policy 37: 1–12. Doi: http://dx.doi.org/10.1016/j.telpol.2012.06.002.

Vaughan, L. and M. Thelwall. 2004. “Search Engine Coverage Bias: Evidence and Possible Causes.” Information Processing & Management 40: 693–707. Doi: http://dx.doi.org/10.1016/S0306-4573(03)00063-3.

Vaughan, L.W. and Y.J. Zhang. 2007. “Equal Representation by Search Engines? A Comparison of Websites Across Countries and Domains.” Journal of Computer-Mediated Communication 12: 888–909. Doi: http://dx.doi.org/10.1111/j.1083-6101.2007.00355.x.

Whipple, E.C., K.L. Allgood, and E.M. Larue. 2012. “Third-Year Medical Students’ Knowledge of Privacy and Security Issues Concerning Mobile Devices.” Medical Teacher 34: e532–e548. Doi: http://dx.doi.org/10.3109/0142159X.2012.670319.

Journal of Official Statistics

The Journal of Statistics Sweden

Journal Information

IMPACT FACTOR 2018: 0,837
5-year IMPACT FACTOR: 0,934

CiteScore 2018: 1.04

SCImago Journal Rank (SJR) 2018: 0.963
Source Normalized Impact per Paper (SNIP) 2018: 1.020

Cited By


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 368 249 21
PDF Downloads 154 119 19