Estimating Classification Errors Under Edit Restrictions in Composite Survey-Register Data Using Multiple Imputation Latent Class Modelling (MILC)

Open access

Abstract

Both registers and surveys can contain classification errors. These errors can be estimated by making use of a composite data set. We propose a new method based on latent class modelling to estimate the number of classification errors across several sources while taking into account impossible combinations with scores on other variables. Furthermore, the latent class model, by multiply imputing a new variable, enhances the quality of statistics based on the composite data set. The performance of this method is investigated by a simulation study, which shows that whether or not the method can be applied depends on the entropy R2 of the latent class model and the type of analysis a researcher is planning to do. Finally, the method is applied to public data from Statistics Netherlands.

André, S. and C. Dewilde. 2016. “Home Ownership and Support for Government Redistribution.” Comparative European Politics 14: 319–348. Doi: http://dx.doi.org/10.1057/cep.2014.31.

Bakk, Z., D.L. Oberski, and J.K. Vermunt. 2016. “Relating Latent Class Membership to Continuous Distal Outcomes: Improving the LTB Approach and a Modified Three-Step Implementation.” Structural Equation Modeling: A Multidisciplinary Journal 23: 278–289. Doi: http://dx.doi.org/10.1080/10705511.2015.1049698.

Bakker, B.F.M. 2009. Trek alle registers open! Rede in verkorte vorm uitgesproken bij de aanvaarding van het ambt van bijzonder hoogleraar Methodologie van registers voor sociaalwetenschappelijk onderzoek bij de Faculteit der Sociale Wetenschappen van de Vrije Universiteit Amsterdam op 26 november 2009. Available at: http://dare.ubvu.vu.nl/bitstream/handle/1871/15588/Oratie%20Bakker.pdf (accessed April 24, 2017).

Bakker, B.F.M. 2010. “Micro-Integration, State of the Art.” Paper presented at the joint UNECE-Eurostat expert group meeting on registered based censuses in The Hague, May 11, 2010. Available at: https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.41/2010/wp.10.e.pdf (accessed April 24, 2017).

Bakker, B.F.M. 2012. “Estimating the Validity of Administrative Variables.” Statistica Neerlandica 66: 8–17. Doi: http://dx.doi.org/10.1111/j.14679574.2011.00504.x.

Biemer, P.P. 2011. Latent Class Analysis of Survey Error (Vol. 571). Hoboken, New Jersey: John Wiley & Sons.

De Waal, T. 2016. “Obtaining Numerically Consistent Estimates from a Mix of Administrative Data and Surveys.” Statistical Journal of the IAOS 32: 231–243. Doi: http://dx.doi.org//10.3233/SJI-150950.

De Waal, T., J. Pannekoek, and S. Scholtus. 2011. Handbook of Statistical Data Editing and Imputation (Vol. 563). John Wiley & Sons.

De Waal, T., J. Pannekoek, and S. Scholtus. 2012. “The Editing of Statistical Data: Methods and Techniques for the Efficient Detection and Correction of Errors and Missing Values.” Wiley Interdisciplinary Reviews: Computational Statistics 4: 204–210. Doi: http://dx.doi.org/10.1002/wics.1194.

Dewilde, C. and P.D. Decker. 2016. “Changing Inequalities in Housing Outcomes Across Western Europe.” Housing, Theory and Society 33: 121–161. Doi: http://dx.doi.org/10.1080/14036096.2015.1109545.

Dias, J.G. and J.K. Vermunt. 2008. “A Bootstrap-Based Aggregate Classifier for Model-Based Clustering.” Computational Statistics 23: 643–659. Doi: http://dx.doi.org/10.1007/s00180-007-0103-7.

Forcina, A. 2008. “Identifiability of Extended Latent Class Models with Individual Covariates.” Computational Statistics & Data Analysis 52: 5263–5268. Doi: http://dx.doi.org/10.1016/j.csda.2008.04.030.

Geerdinck, M., M. Goedhuys-van der Linden, E. Hoogbruin, A. De Rijk, N. Sluiter, and C. Verkleij. 2014. Monitor Kwaliteit Stelsel van Basisregistraties: Nulmeting van de Kwaliteit van Basisregistraties in Samenhang, 2014 (13114th ed.). Henri Faas-dreef 312, 2492 JP Den Haag: Centraal Bureau voor de Statistiek. Available at: https://www.cbs.nl/-/media/pdf/2016/50/monitor-kwaliteit-stelsel-van-basisregistraties.pdf (accessed April 25, 2017).

Groen, J.A. 2012. “Sources of Error in Survey and Administrative Data: The Importance of Reporting Procedures.” Journal of Official Statistics 28: 173–198.

Guarnera, U. and R. Varriale. 2016. “Estimation from Contaminated Multi-Source Data Based on Latent Class Models.” Statistical Journal of the IAOS 32: 537–544. Doi: dx.doi.org//10.3233/SJI-150951.

Jörgren, F., R. Johansson, L. Damber, and G. Lindmark. 2010. “Risk Factors of Rectal Cancer Local Recurrence: Population-Based Survey and Validation of the Swedish Rectal Cancer Registry.” Colorectal Disease 12: 977–986. Doi: http://dx.doi.org/10.1111/j.1463-1318.2009.01930.x.

Kim, H.J., L.H. Cox, A.F. Karr, J.P. Reiter, and Q. Wang. 2015. “Simultaneous Edit-Imputation for Continuous Microdata.” Journal of the American Statistical Association 110: 987–999. Doi: http://dx.doi.org/10.1080/01621459.2015.1040881.

Lersch, P.M. and C. Dewilde. 2015. “Employment Insecurity and First-Time Homeownership: Evidence from Twenty-Two European Countries.” Environment and Planning A 47: 607–624. Doi: http://dx.doi.org//10.1068/a130358p.

Manrique-Vallier, D. and J.P. Reiter. 2013. “Bayesian Multiple Imputation for Large-Scale Categorical Data with Structural Zeros.” Survey Methodology 40: 125–134. Available at: https://ecommons.cornell.edu/handle/1813/34889 (accessed April 25, 2017).

Manrique-Vallier, D. and J.P. Reiter. 2016. “Bayesian Simultaneous Edit and Imputation for Multivariate Categorical Data.” Journal of the American Statistical Association. Doi: http://dx.doi.org/10.1080/01621459.2016.1231612.

Mulder, C.H. 2006. “Home-Ownership and Family Formation.” Journal of Housing and the Built Environment 21: 281–298. Doi: http://dx.doi.org/10.1007/s10901-006-9050-9.

Ness, A.R. 2004. “The Avon Longitudinal Study of Parents and Children (ALSPAC)- a Resource for the Study of the Environmental Determinants of Childhood Obesity.” European Journal of Endocrinology 151(Suppl 3): U141–U149. Doi: http://dx.doi.org//10.1530/eje.0.151U141.

Oberski, D.L. 2015. “Total Survey Error in Practice.” In Total Survey Error, edited by P.P. Biemer, E. de Leeuw, S. Eckman, B. Edwards, F. Kreuter, L. Lyberg, N. Tucker, and B. West. New York: Wiley.

Pavlopoulos, D. and J. Vermunt. 2015. “Measuring Temporary Employment. Do Survey or Register Tell the Truth?” Survey Methodology 41: 197–214. Available at: http://www.statcan.gc.ca/pub/12-001-x/2015001/article/14151-eng.pdf (accessed April 25, 2017).

R Core Team. 2014. “R: A Language and Environment for Statistical Computing [Computer software manual].” Vienna, Austria. Available at: http://www.R-project.org/ (accessed October 13, 2017).

Robertsson, O., M. Dunbar, K. Knutson, S. Lewold, and L. Lidgren. 1999. “Validation of the Swedish Knee Arthroplasty Register: A Postal Survey Regarding 30,376 Knees Operated on Between 1975 and 1995.” Acta Orthopaedica Scandinavica 70: 467–472. Doi: http://dx.doi.org/10.3109/17453679909000982.

Rubin, D.B. 1987. Multiple Imputation for Nonresponse in Surveys (Vol. 81). John Wiley & Sons. Doi: http://dx.doi.org//10.1002/9780470316696.

Scherpenzeel, A. 2011. “Data Collection in a Probability-Based Internet Panel: How the LISS Panel was Built and How it can be Used.” Bulletin of Sociological Methodology/Bulletin de Methodologie Sociologique 109: 56–61. Doi: http://dx.doi.org//10.1177/0759106310387713.

Scholtus, S. 2009. “Automatic Detection of Simple Typing Errors in Numerical Data with Balance Edits.” Statistics Netherlands Discussion Paper (09046). Available at: https://www.cbs.nl/-/media/imported/documents/2009/48/2009-46-x10-pub.pdf (accessed April 25, 2017).

Scholtus, S. 2011. “Algorithms for Correcting Sign Errors and Rounding Errors in Business Survey Data.” Journal of Official Statistics 27: 467–490.

Scholtus, S. and B.F.M. Bakker. 2013. “Estimating the Validity of Administrative and Survey Variables through Structural Equation Modeling: A Simulation Study on Robustness.” Statistics Netherlands Discussion Paper. Available at: https://www.cbs.nl/-/media/imported/documents/2013/12/2013-02-x10-pub.pdf (accessed April 25, 2017).

Schrijvers, C.T.M., K. Stronks, D.H. van de Mheen, J.-W. W. Coebergh, and J.P. Mackenbach. 1994. “Validation of Cancer Prevalence Data from a Postal Survey by Comparison with Cancer Registry Records.” American Journal of Epidemiology 139: 408–414. Doi: https://doi.org/10.1093/oxfordjournals.aje.a117013.

Schulte Nordholt, E., J. Van Zeijl, and L. Hoeksma. 2014. “Dutch Census 2011, Analysis and Methodology.” Statistics Netherlands. Available at: https://www.cbs.nl/-/media/imported/documents/2014/44/2014-b57-pub.pdf (accessed April 25, 2017).

Si, Y. and J.P. Reiter. 2013. “Nonparametric Bayesian Multiple Imputation for Incomplete Categorical Variables in Large-Scale Assessment Surveys.” Journal of Educational and Behavioral Statistics 38: 499–521. Doi: dx.doi.org//10.3102/1076998613480394.

Tempelman, C. 2007. Imputation of Restricted Data: Applications to Business surveys (Doctoral dissertation, Rijksuniversiteit Groningen). Available at: https://www.cbs.nl/-/media/imported/documents/2007/05/2007-i76-pub.pdf (accessed April 25, 2017).

Turner, C.F., T.K. Smith, L.K. Fitterman, T. Reilly, K. Pate, M.B. Witt, and B.H. Forsyth. 1997. “The Quality of Health Data Obtained in a New Survey of Elderly Americans: A Validation Study of the Proposed Medicare Beneficiary Health Status Registry (mbhsr).” The Journals of Gerontology Series B: Psychological Sciences and Social Sciences 52B: S49–S58. Doi: http://dx.doi.org//10.1093/geronb/52B.1.S49.

Understanding Society. 2016. “Understanding Society: Innovation Panel, Waves 1–7, 2008–2014 [data collection]. 6th edition [Computer software manual]. UK Data Service. Doi: 10.5255/UKDA-SN-6849-7.

University of London. Institute of Education. Centre for Longitudinal Studies, Millennium Cohort Study: First Survey, 2001–2003 [computer file]. 6th edition. Colchester, Essex: UK Data Archive [distributor], SN: 4683. (2007, March). Available at: http://dx.doi.org/10.5255/UKDA-SN-4683-1.

Van der Palm, D.W., L.A. Van der Ark, and J.K. Vermunt. 2016. “Divisive Latent Class Modeling as a Density Estimation Method for Categorical Data.” Journal of Classification 1–21. Doi: http://dx.doi.org/10.1007/s00357-016-9195-5.

Van der Vaart, W. and T. Glasner. 2007. “Applying a Timeline as a Recall Aid in a Telephone Survey: a Record Check Study.” Applied Cognitive Psychology 21: 227–238. Doi: http://dx.doi.org/10.1002/acp.1338.

Vermunt, J.K. and J. Magidson. 2004. “Latent Class Analysis.” The Sage Encyclopedia of Social Sciences Research Methods 549–553. Available at: http://members.home.nl/jeroenvermunt/ermss2004a.pdf (accessed April 25, 2017).

Vermunt, J.K. and J. Magidson. 2013a. Latent GOLD 5.0 Up-grade Manual [Computer software manual]. Belmont, MA. Available at: https://www.statisticalinnovations.com/wp-content/uploads/LG5manual.pdf (accessed April 25, 2017).

Vermunt, J.K. and J. Magidson. 2013b. “Technical Guide for Latent GOLD 5.0: Basic, Advanced, and Syntax.” Statistical Innovations Inc., Belmont, MA. Available at: https://www.statisticalinnovations.com/wp-content/uploads/LGtecnical.pdf (accessed April 25, 2017).

Vermunt, J.K., J.R. Van Ginkel, L.A. Van Der Ark, and K. Sijtsma. 2008. “Multiple Imputation of Incomplete Categorical Data Using Latent Class Analysis.” Sociological Methodology 38: 369–397. Doi: http://dx.doi.org/10.1111/j.1467-9531.2008.00202.x.

Vink, G. and S. van Buuren. 2014. “Pooling Multiple Imputations When the sample Happens to be the Population.” arXiv preprint arXiv:1409.8542. Available at: https://arxiv.org/abs/1409.8542.

Zhang, L.-C. 2012. “Topics of Statistical Theory for Register-Based Statistics and Data Integration.” Statistica Neerlandica 66: 41–63. Available at: http://dx.doi.org/10.1111/j.1467-9574.2011.00508.x.

Zhang, L.-C. and J. Pannekoek. 2015. “Optimal Adjustments for Inconsistency in Imputed Data.” Survey Methodology 41: 127–144. Available at: http://www.statcan.gc.ca/pub/12-001-x/12-001-x2015001-eng.pdf (accessed April 25, 2017).

Journal of Official Statistics

The Journal of Statistics Sweden

Journal Information


IMPACT FACTOR 2017: 0.662
5-year IMPACT FACTOR: 1.113

CiteScore 2017: 0.74

SCImago Journal Rank (SJR) 2017: 1.158
Source Normalized Impact per Paper (SNIP) 2017: 0.860

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 77 77 27
PDF Downloads 30 30 15