An Evolutionary Schema for Using “it-is-what-it-is” Data in Official Statistics

Open access

Abstract

The linking of disparate data sets across time, space and sources is probably the foremost current issue facing Central Statistical Agencies (CSA). If one reviews the current literature looking for the prevalent challenges facing CSAs, three issues stand out: 1) using administrative data effectively; 2) big data and what it means for CSAs; and 3) integrating disparate data set (such as health, education and wealth) to provide measurable facts that can guide policy makers. CSAs are being challenged to explore the same kind of challenges faced by Google, Facebook, and Yahoo, which are using graphical/semantic web models for organizing, searching and analysing data. Additionally, time and space (geography) are becoming more important dimensions (domains) for CSAs as they start to explore new data sources and ways to integrate those to study relationships. Central agency methodologists are being pushed to include these new perspectives into their standard theories, practises and policies. Like most methodologists, the authors see surveys and the publications of their results as a process where estimation is the key tool to achieve the final goal of an accurate statistical output. Randomness and sampling exists to support this goal, and early on it was clear to us that the incoming “it-is-what-it-is” data sources were not randomly selected. These sources were obviously biased and thus would produce biased estimates. So, we set out to design a strategy to deal with this issue.

This article presents a schema for integrating and linking traditional and non-traditional datasets. Like all survey methodologies, this schema addresses the fundamental issues of representativeness, estimation and total survey error measurement.

Baker, R., S.J. Blumberg, J.M. Brick, M.P. Couper, M. Courtright, M. Dennis, D. Dillman, M.R. Frankel, P. Garland, R.M. Groves, C. Kennedy, J. Krosnick, P.J. Lavrakas, S. Lee, M. Link, L. Piekarski, K. Rao, R.K. Thomas, and D. Zahs. 2010. “AAPOR Report on Online Panels.” Public Opinion Quarterly 74(4): 711–781. Doi: https://doi.org/10.1093/poq/nfq048 (accessed May 2018).

Baker, R., J.M. Brick, N.A. Bates, M.P. Battaglia, M.P. Couper, J.A. Dever, K.J. Gile, and R. Tourangeau. 2013. “Summary Report of the AAPOR Task Force on Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1(2): 90–143. Doi: https://doi.org/10.1093/jssam/smt008 (accessed May 2018).

Bakker, B.F.M. and P.J.H. Daas. 2012. “Methodological Challenges of Register-based Research.” Statistica Neerlandica 66(1): 2–7. Doi: http://dx.doi.org/10.1111/j.1467-9574.2011.00505.x (accessed: May 2018).

Biemer, P.P. 2010. “Total Survey Error: Design, Implementation, and Evalutaion.” Public Opinion Quarterly 74(5): 817–848. Doi: http://dx.doi.org/10.1093/poq/nfq058 (accessed May 2018).

Bryant, J.R. and P. Graham. 2015. “A Bayesian Approach to Population Estimation with Administrative Data.” Journal of Official Statistics 31(3): 475–487. Doi: http://dx.doi.org/10.1515/JOS-2015-0028 (accessed May 2018).

Dunn, H.L. 1946. “Record Linkage.” American Journal of Public Health 36(12): 1412–1416. Doi: http://dx.doi.org/10.2105/AJPH.36.12.1412. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1624512/ (accessed May 2018).

Fellegi, I.P. and A.B. Sunter. 1969. “A Theory for Record Linkage.” Journal of the American Statistical Association 64(328): 1183–1210. Doi: http://dx.doi.org/10.1080/01621459.1969.10501049 (accessed May 2018).

Ferrara, A., A. Nikolov, and F. Scharffe. 2011. “Data Linking for the Semantic Web.” International Journal on Semantic Web & Information Systems 7(3): 46–76. Doi: http://dx.doi.org/10.4018/jswis.2011070103 (accessed May 2018).

Fowler, M. and P. Sadalage. 2003. Evolutionary Database Design. Available at: http://martinfowler.com/articles/evodb.html (accessed May 2018).

Groves, R.M. and L. Lyberg. 2010. “Total Survey Error: Past, Present, and Future.” Public Opinion Quarterly 74(5): 849–879. Doi: http://dx.doi.org/10.1093/poq/nfq065 (accessed May 2018).

Hand, D.J. 2018. “Statistical Challenges of Administrative and Transaction Data.” Journal of the Royal Statistical Society. Series A (Statistics in Society) 181(Part 3): 1–24. Doi: http://dx.doi.org/10.1111/rssa.12315 (accessed May 2018).

Holman, C.D., A.J. Bass, D.L. Rosman, M.B. Smith, J.B. Semmens, and F.J. Glasson. 2008. “A Decade of Data Linkage in Western Australia: Strategic Design, Applications and Benefits of the WA Data Linkage System.” Australian Health Review 32(4): 766–777. Available at: https://www.ncbi.nlm.nih.gov/pubmed/18980573 (accessed May 2018).

Holman, C.D., A.J. Bass, I.L. Rouse, and M.S.T. Hobbs. 1999. “Population-based Linkage of Health Records in Western Australia: Development of a Health Services Research Linked Database.” Australian and New Zealand Journal of Public Health 23(5): 453–459. Available at: https://www.ncbi.nlm.nih.gov/pubmed/10575763 (accessed May 2018).

Holmberg, A., K. Blomqvist, J. Engdahl, H. Irebäck, L.-G. Lundell, and J. Svensson. 2011. A Strategy to Improve the Register System to Store, Share and Access Data and its Connections to a Generic Statistical Information Model (GSIM). Paper presented at the Work Session on Statistical Data Editing, UNECE, Ljubljana, Slovenia, May 9–11. Available at: https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2011/wp.37.e.pdf (accessed May 2018).

Holt, T. 2000. “The Future for Official Statistics.” Journal of the Operational Research Society 51(9): 1010–1019. Doi: http://dx.doi.org/10.1057/palgrave.jors.2600999. Available at: http://www.jstor.org/stable/254222 (accessed May 2018).

Jabine, T.B. and F.J. Scheuren. 1985. “Goals for Statistical Uses of Administrative Records: The Next 10 Years.” Journal of Business & Economic Statistics 3(4): 380–391. Doi: http://dx.doi.org/10.2307/1391725 (accessed May 2018).

Kruskal, W. and F. Mosteller. 1979. “Representative Sampling, II: Scientific Literature, Excluding Statistics.” International Statistical Review/Revue Internationale de Statistique 47(2): 111–127. Doi: http://dx.doi.org/10.2307/1402564. Available at: http://www.jstor.org/stable/1402564 (accessed May 2018).

Langer, G. 2013. “Comment: Summary Report Of The AAPOR Task Force On Non-Probability Sampling.” Journal of Survey Statistics and Methodology 1: 130–136. Doi: http://dx.doi.org/10.1093/jssam/smt008 (accessed May 2018).

Little, R.J.A. 2012. “Calibrated Bayes, an Alternative Inferential Paradigm for Official Statistics.” Journal of Official Statistics 28(3): 309–334. Available at: http://www.jos.nu/Articles/abstract.asp?article=283309 (accessed May 2018).

Little, R.J. 2015. “Calibrated Bayes, an Inferential Paradigm for Official Statistics in the Era of Big Data.” Statistical Journal of the IAOS 31: 555–563. Doi: http://dx.doi.org/10.3233/SJI-150944 (accessed May 2018).

Lohr, S.L., V. Hsu, and J.M. Montaquila. 2015. Using Classification and Regression Trees to Model Survey Nonresponse. Paper presented at the Joint Statistical Meeting (Section on Survey Research Methods), Seattle, Washington, United States. Available at: https://ww2.amstat.org/sections/srms/Proceedings/y2015/files/234054.pdf (accessed May 2018).

Lothian, J., A. Holmberg, and A. Seyb. 2017. Linking Administrative Data: An Evolutionary Schema. Available at: SAO/NASA Astrophysics Data System ArXiv. (arXiv:1712.085522 [stat.ME]), accessed May 2018, from Cornell University Library, Available at: http://adsabs.harvard.edu/abs/2017arXiv171208522L (accessed May 2018).

Lundström, S. and S. Särndal. 2005. Estimation in Surveys with Nonresponse. Chichester, United Kingdom: John Wiley & Sons, Ltd.

Rancourt, É., H. Lee, and C.-E. Särndal. 1994. “Bias Corrections for Survey Estimates from Data with Ratio Imputed Values for Confounded Responses.” Survey Methodology 20(2): 137–147. Available at: http://www.statcan.gc.ca/pub/12-001-x/1994002/article/14423-eng.pdf (accessed May 2018).

Rao, J.N.K. 2011. “Impact of Frequentist and Bayesian Methods on Survey Sampling Practice: A Selective Appraisal.” Statistical Science 26(2): 240–256. Doi: http://dx.doi.org/10.1214/10-STS346. Available at: http://www.jstor.org/stable/23059987 (accessed May 2018).

Reid, G., F. Zabala, and A. Holmberg. 2017. “Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ.” Journal of Official Statistics 33(2): 477–511. Doi: http://dx.doi.org/10.1515/JOS-2017-0023 (accessed May 2018).

Särndal, C.E. 2007. “The Calibration Approach in Survey Theory and Practice.” Survey Methodology 33(2): 99–119. Available at: http://www5.statcan.gc.ca/olc-cel/olc.action?objId=12-001-X200700210488&objType=47&lang=en&limit=0 (accessed May 2018).

Särndal, C-E., B. Swensson, and J.H. Wretman. 1992. Model Assisted Survey Sampling. New York: Springer-Verlag.

Thygesen, L. and M. Grosen-Mielsen. 2013. “How to Fulfil User Needs – from Industrial Production of Statistics to Production of Knowledge.” Statistical Journal of the IAOS 29: 301–313. Doi: http://dx.doi.org/10.3233/SJI-130784 Available at: https://content. iospress.com/articles/statistical-journal-of-the-iaos/sji00784 (accessed May 2018).

Valliant, R., A.H. Dorfman, and R.M. Royall. 2000. Finite Population Sampling and Inference: A Prediction Approach. New York: John Wiley & Sons.

Wallgren, A. and B. Wallgren. 2014. Register-based Statistics: Statistical Methods for Administrative Data (2nd edition). Chichester, West Sussex, England: John Wiley & Sons, Ltd.

Winkler, W.E. 2009. “Chapter 14: Record Linkage.” In Sample Surveys: Design, Methods and Applications, edited by D. Pfeffermann and C.R. Rao, Vol. 29A, 351–380. Oxford, United Kingdom: Elsevier B.V.

Wu, C. and R.R. Sitter. 2001. “A Model-Calibration Approach to Using Complete Auxiliary Information from Survey Data.” Journal of the American Statistical Association 96(453): 185–193. Doi: http://dx.doi.org/10.1198/016214501750333054 (accessed May 2018).

Zhang, L.-C. 2012. “Topics of Statistical Theory for Register-based Statistics and Data Integration.” Statistica Neerlandica 66(1): 41–63. Doi: http://dx.doi.org/10.1111/j. 1467-9574.2011.00508.x (accessed May 2018).

Journal of Official Statistics

The Journal of Statistics Sweden

Journal Information


IMPACT FACTOR 2017: 0.662
5-year IMPACT FACTOR: 1.113

CiteScore 2017: 0.74

SCImago Journal Rank (SJR) 2017: 1.158
Source Normalized Impact per Paper (SNIP) 2017: 0.860

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 135 135 135
PDF Downloads 142 142 142