Population Size Estimation and Linkage Errors: the Multiple Lists Case

Open access

Abstract

Data integration is now common practice in official statistics and involves an increasing number of sources. When using multiple sources, an objective is to assess the unknown size of the population. To this aim, capture-recapture methods are applied. Standard capture-recapture methods are based on a number of strong assumptions, including the absence of errors in the integration procedures. However, in particular when the integrated sources were not originally collected for statistical purposes, this assumption is unlikely and linkage errors (false links and missing links) may occur. In this article, the problem of adjusting population estimates in the presence of linkage errors in multiple lists is tackled; under homogeneous linkage error probabilities assumption, a solution is proposed in a realistic and practical scenario of multiple lists linkage procedure.

Agresti, A. 1994. “Simple Capture-Recapture Models Permitting Unequal Catchability and Variable Sampling Effort.” Biometrics 50: 494–500. Doi: http://dx.doi.org/10.2307/2533391.

Bartolucci, F. and A. Forcina. 2006. “A Class of Latent Marginal Models for Capture-Recapture Data with Continuous Covariates.” Journal of the American Statistical Association 101: 786–794. Doi: http://dx.doi.org/10.1198/073500105000000243.

Belin, T.R. and D.B. Rubin. 1995. “A Method for Calibrating False-Match Rates in Record Linkage.” Journal of the American Statistical Association 90: 694–707. Doi: http://dx.doi.org/10.1080/01621459.1995.10476563.

Chambers, R. 2009. “Regression Analysis of Probability-Linked Data.” Official Statistics Research Series 4. Available at http://www3.stats.govt.nz/statisphere/Official_Statistics_Research_Series/Regression_Analysis_of_Probability-Linked_Data.pdf (accessed November 2018).

Chao, A. 2001. “An overview of closed Capture-Recapture Models.” Journal of Agricultural, Biological, and Environmental Statistics 6: 158–175. Doi: http://dx.doi.org/10.1198/108571101750524670.

Chipperfield, J. and R. Chambers. 2015. “Using the Bootstrap to Account for Linkage Errors when Analysing Probabilistically Linked Categorical Data.” Journal of Official Statistics 31(3): 397–414. Doi: http://dx.doi.org/10.1515/jos-2015-0024.

Cormack, R.M. 1989. “Log-Linear Models for Capture-Recapture.” Biometrics 45: 395–413. Doi: http://dx.doi.org/10.2307/2531485.

Coull, B.A. and A. Agresti. 1999. “The Use of Mixed Logit Models to Reflect Heterogeneity in Capture-Recapture Studies.” Biometrics 55: 294–301. Doi: http://dx.doi.org/10.1111/j.0006-341X.1999.00294.x.

Darroch, J.N. 1958. “The Multiple-Recapture Census: I. Estimation of a closed population.” Biometrika 45: 343–359. Doi: http://dx.doi.org/10.2307/2333183.

Darroch, J.N., S.E. Fienberg, G.F.V. Glonek, and B.W. Junker. 1993. “A Three-Sample Multiple-Recapture Approach to Census Population Estimation with Heterogeneous Catchability.” Journal of the American Statistical Association 88: 1137–1148. Doi: http://dx.doi.org/10.2307/2290811.

Di Cecco, D., M. Di Zio, D. Filipponi, and I. Rocchetti. 2016. “Estimating Population Size from Multisource Data with Coverage and Unit Errors.” In Proceeding of the ICES-V, Geneva, Switzerland, June 20–23, 2016. Available at http://ww2.amstat.org/meetings/ices/2016/proceedings/165_ices15Final00072.pdf (accessed November 2018).

Di Consiglio, L. and T. Tuoto. 2015. “Coverage Evaluation on Probabilistically Linked Data.” Journal of Official Statistics 31(3): 415–429. Doi: http://dx.doi.org/10.1515/JOS-2015-0025.

Ding, Y. and S.E. Fienberg. 1994. “Dual System Estimation of Census Undercount in the Presence of Matching Error.” Survey Methodology 20: 149–158. Available at https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1994002/article/14422-eng.pdf?st=YtHflfaV (accessed November 2018).

Evans, M.A., D.G. Bonett, and L.L. McDonald. 1994. “A General Theory for Modeling Capture-Recapture Data from a Closed Population.” Biometrics 50(2): 396–405. Doi: http://dx.doi.org/10.2307/2533383.

Farcomeni, A. and L. Tardella. 2009. “Reference Bayesian Methods for Recapture Models with Heterogeneity.” Test, May 2010, 19(1): 187–208. Doi: http://dx.doi.org/10.1007/s11749-009-0147-9.

Fellegi, I. and A. Sunter. 1969. “A Theory of Record Linkage.” Journal of the American Statistical Association 64: 1183–2010. Doi: http://dx.doi.org/10.1080/01621459.1969.10501049.

Fienberg, S.E. 1972. “The Multiple Recapture Census for Closed Populations and Incomplete 2k Contingency Tables.” Biometrika 59: 409–439. Doi: http://dx.doi.org/10.1093/biomet/59.3.591.

Fienberg, S.E. 2015. “Discussion.” Journal of Official Statistics 31(3): 527–535. Doi: http://dx.doi.org/10.1515/JOS-2015-0032.

Fienberg, S.E. and Y. Ding. 1996. “Multiple Sample Estimation of Population and Census Undercount in the Presence of Matching Error.” In Proceedings of 1994 Annual research conference and CASIC technologies Interchange, Bureau of Census, United States. Available at: https://www150.statcan.gc.ca/n1/en/pub/12-001-x/1996001/article/14385-eng.pdf?st=8LhKz2Tt (accessed November 2018).

Fienberg, S.E. and D. Manrique-Vallier. 2009. “Integrated Methodology for Multiple Systems Estimation and Record Linkage Using a Missing Data Formulation.” Advances in Statistical Analysis 93: 49–60. Doi: http://dx.doi.org/10.1007/s10182-008-0084-z.

Fortini, M., B. Liseo, A. Nuccitelli, and M. Scanu. 2001. “On Bayesian Record Linkage.” Research in Official Statistics 4(1): 185–198.

Herzog, T., F. Scheuren, and W. Winkler. 2007. Data Quality and Record Linkage Techniques. New York: Springer-Verlag. Doi: http://dx.doi.org/10.1007/0-387-69505-2.

IWGDMF – International Working Group for Disease Monitoring and Forecasting. 1995. “Capture-Recapture and Multiple-Record Systems Estimation I: History and Theoretical Development.” American Journal of Epidemiology 142: 1047–1058. Doi: http://dx.doi.org/10.1093/oxfordjournals.aje.a117558.

Jaro, M. 1989. “Advances in Record Linkage Methodology as Applied to Matching the 1985 Test Census of Tampa, Florida.” Journal of American Statistical Association 84: 414–420. Doi: http://dx.doi.org/10.1080/01621459.1989.10478785.

Larsen, M.D. 1996. Bayesian Approaches to Finite Mixture Models, Ph.D. Thesis, Harvard University.

Larsen, M.D. and D.B. Rubin. 2001. “Iterative Automated Record Linkage Using Mixture Models.” Journal of the American Statistical Association 96: 32–41. Doi: http://dx.doi.org/10.1198/016214501750332956.

Lee, A.J., G.A.F. Seber, J.K. Holden, and J.T. Huakau. 2001. “Capture-Recapture, Epidemiology, and List Mismatches: Several Lists.” Biometrics 57: 707–713. Doi: http://dx.doi.org/10.1111/j.0006-341X.2001.00707.x.

Lincoln, F.C. 1930. Calculating Waterfowl Abundance on the Basis of Banding Returns. United States Department of Agriculture Circular, 118, 1–4.

Link, W.A., J. Yoshizaki, L.L. Bailey, and K.H. Pollok. 2010. “Uncovering a Latent Multinomial: Analysis of Mark-Recapture Data with Misidentification.” Biometrics 66: 178–185. Doi: http://dx.doi.org/10.1111/j.1541-0420.2009.01244.x.

Liseo, B. and A. Tancredi. 2011. “Bayesian Estimation of Population Size Via Linkage of Multivariate Normal Data Sets.” Journal of Official Statistics 27(3): 491–505. Available at: https://www.scb.se/contentassets/ff271eeeca694f47ae99b942de61df83/bayesian-estimation-of-population-size-via-linkage-of-multivariate-normal-data-sets.pdf (accessed November 2018).

McLeod, P., D. Heasman, and I. Forbes. 2011. Simulated data for the on the job training. Essnet DI. Available at http://www.cros-portal.eu/content/job-training.

Mulry, M.H., A. Dajani, and P. Biemer. 1989. “The Matching Error Study for the 1988 Dress Rehearsal.” In Proceedings of the Section on Survey Research Methods, ASA, 704–709. Available for instance at researchgate: https://www.researchgate.net/publication/267379153_THE_MATCHING_ERROR_STUDY_FOR_THE_1988_DRESS_REHEARSAL/download.

Parag and P. Domingos. 2004. “Multi-Relational Record Linkage.” In Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining. Available at: https://homes.cs.washington.edu/~pedrod/papers/mrdm04.pdf (accessed November 2018).

Petersen, C.G.J. 1896. The Yearly Immigration of Young Plaice into the Limfiord from the German Sea. Report of the Danish Biological Station 6: 5–84.

Pollock, K.H., J.D. Nichols, C. Brownie, and J.E. Hines. 1990. “Statistical Inference for Capture-Recapture Experiments.” Wildlife monographs 107.

RELAIS. 2015. User’s Guide Version 3.0. Available at http://www.istat.it/en/tools/methods-and-it-tools/processing-tools/relais.

Sadinle, M., R. Hall, and S.E. Fienberg. 2011. “Approaches to Multiple Record Linkage.” In Proceedings of the ISI World Statistical Congress, 21–26 August 2011, Dublin: 1064–1071. Available at: http://2011.isiproceedings.org/papers/450092.pdf (accessed November 2018).

Sadinle, M. and S.E. Fienberg. 2013. “A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems.” Journal of the American Statistical Association 108: 385–397. Doi: http://dx.doi.org/10.1080/01621459.2012.757231.

Sanathanan, L. 1972. “Estimating the Size of a Multinomial Population.” Annals of Mathematical Statistics 43: 142–152. Available at: https://projecteuclid.org/download/pdf_1/euclid.aoms/1177692709 (accessed November 2018).

Steorts, R., R. Hall, and S.E. Fienberg. 2014. “SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication.” Journal of Machine Learning Research 33: 922–930. Available at: http://proceedings.mlr.press/v33/steorts14.pdf (accessed November 2018).

Steorts, R., R. Hall, and S.E. Fienberg. 2015. “A Bayesian Approach to Graphical Record Linkage and De-duplication.” Journal of the American Statistical Association. Available at: URL http://arxiv.org/abs/1312.4645.

Tuoto, T. 2016. “New Proposal for Linkage Error Estimation.” Statistical Journal of the IAOS 32(2): 413–420. Doi: http://dx.doi.org/10.3233/SJI-160995.

Tuoto, T., B.F.M. Bakker, L. Di Consiglio, D.J. van der Laan, P.-P. de Wolf, and D. Zult. 2017. “Two Improvements of the Method for Population Size Estimation.” in Proceedings of the 61st World Statistics Congress 16–21 July 2017, Marrakech.

Ventura, S. and R. Nugent. 2014. “Hierarchical Clustering with Distributions of Distances for Large-Scale Record Linkage.” In Privacy in Statistical Databases, edited by J. Domingo-Ferrer, 283–298. Berlin: Springer Link. Lecture Notes in Computer Science 8744.

Wolter, K.M. 1986. “Some Coverage Error Models for Census Data.” Journal of the American Statistical Association 81: 338– 346. Doi: http://dx.doi.org/10.1080/01621459.1986.10478277.

Zwane, E. and P.G.M. van der Heijden. 2005. “Population Estimation using the Multiple System Estimator in the Presence of Continuous Covariates.” Statistical Modelling 5: 39–52. Doi: http://dx.doi.org/10.1191/1471082X05st086oa.

Journal of Official Statistics

The Journal of Statistics Sweden

Journal Information


IMPACT FACTOR 2017: 0.662
5-year IMPACT FACTOR: 1.113

CiteScore 2017: 0.74

SCImago Journal Rank (SJR) 2017: 1.158
Source Normalized Impact per Paper (SNIP) 2017: 0.860

Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 286 286 29
PDF Downloads 280 280 26