Imprecise Imputation: A Nonparametric Micro Approach Reflecting the Natural Uncertainty of Statistical Matching with Categorical Data

Open access

Abstract

Statistical matching is the term for the integration of two or more data files that share a partially overlapping set of variables. Its aim is to obtain joint information on variables collected in different surveys based on different observation units. This naturally leads to an identification problem, since there is no observation that contains information on all variables of interest.

We develop the first statistical matching micro approach reflecting the natural uncertainty of statistical matching arising from the identification problem in the context of categorical data. A complete synthetic file is obtained by imprecise imputation, replacing missing entries by sets of suitable values. Altogether, we discuss three imprecise imputation strategies and propose ideas for potential refinements.

Additionally, we show how the results of imprecise imputation can be embedded into the theory of finite random sets, providing tight lower and upper bounds for probability statements. The results based on a newly developed simulation design–which is customised to the specific requirements for assessing the quality of a statistical matching procedure for categorical data–corroborate that the narrowness of these bounds is practically relevant and that these bounds almost always cover the true parameters.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Ahfock D. S. Pyne S.X. Lee and G.J. McLachlan. 2016. “Partial Identification in the Statistical Matching Problem.” Computational Statistics & Data Analysis 104: 79–90. Doi: https://doi.org/10.1016/j.csda.2016.06.005.

  • Andridge R.R. and R.J.A. Little. 2010. “A Review of Hot Deck Imputation for Survey Nonresponse.” International Statistical Review 78: 40–64. Doi: https://doi.org/10.1111/j.1751-5823.2010.00103.x.

  • Augustin T. Coolen F.P.A. de Cooman G. and Troffaes M.C.M. (Eds.). 2014. Introduction to Imprecise Probabilities. Chichester: Wiley. Doi: https://doi.org/10.1002/9781118763117.

  • Barbiero A. and P.A. Ferrari. 2017. “An R Package for the Simulation of Correlated Discrete Variables.” Communications in Statistics – Simulation and Computation 46: 5123–5140. Doi: https://doi.org/10.1080/03610918.2016.1146758.

  • Cattaneo M. 2013. “Likelihood Decision Functions.” Electronic Journal of Statistics 7: 2924–2946. Doi: https://doi.org/10.1214/13-EJS869.

  • Conti P.L. D. Marella and M. Scanu. 2008. “Evaluation of Matching Noise for Imputation Techniques Based on Nonparametric Local Linear Regression Estimators.” Computational Statistics & Data Analysis 53: 354–365. Doi: https://doi.org/10.1016/j.csda.2008.07.041.

  • Conti P.L. D. Marella and M. Scanu. 2012. “Uncertainty Analysis in Statistical Matching.” Journal of Official Statistics 28: 69–88. Available at: http://www.scb.se/dokumentation/statistiska-metoder/JOS-archive/ (accessed July 2019).

  • Conti P.L. D. Marella and M. Scanu. 2017. “How Far from Identifiability? A Systematic Overview of the Statistical Matching Problem in a Non Parametric Framework.” Communications in Statistics Theory and Methods 46: 967 – 994. Doi: https://doi.org/10.1080/03610926.2015.1010005.

  • Couso I. and D. Dubois. 2014. “Statistical Reasoning with Set-valued Information: Ontic vs. Epistemic Views.” International Journal of Approximate Reasoning 55: 1502–1518. Doi: https://doi.org/10.1016/j.ijar.2013.07.002.

  • Couso I. D. Dubois and L. Sánchez. 2014. Random Sets and Random Fuzzy Sets as Ill-Perceived Random Variables. Cham: Springer. Doi: https://doi.org/10.1007/978-3-319-08611-8.

  • De Campos L.M. M.T. Lamata and S. Moral. 1990. “The Concept of Conditional Fuzzy Measure.” International Journal of Intelligent Systems 5: 237–246. Doi: https://doi.org/10.1002/int.4550050302.

  • Dempster A.P. 1967. “Upper and Lower Probabilities Induced By a Multivalued Mapping.” The Annals of Mathematical Statistics 38: 325–339. Doi: https://doi.org/10.1214/aoms/1177698950.

  • Denoeux T. 2016. “40 Years of Dempster-Shafer Theory.” International Journal of Approximate Reasoning 79: 1–6. Doi: https://doi.org/10.1016/j.ijar.2016.07.010.

  • Di Zio M. and B. Vantaggi. 2017. “Partial Identification in Statistical Matching with Misclassification.” International Journal of Approximate Reasoning 82: 227–241. Doi: https://doi.org/10.1016/j.ijar.2016.12.015.

  • D’Orazio M. M. Di Zio and M. Scanu. 2006a. “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints.” Journal of Official Statistics 22: 137–157. Available at: http://www.scb.se/dokumentation/statistiska-metoder/JOS-archive/ (accessed July 2019).

  • D’Orazio M. M. Di Zio and M. Scanu. 2006b. Statistical Matching: Theory and Practice. Chichester: Wiley. Doi: https://doi.org/10.1002/0470023554.

  • D’Orazio M. M. Di Zio and M. Scanu. 2017. “The Use of Uncertainty to Choose Matching Variables in Statistical Matching.” International Journal of Approximate Reasoning 90: 433–440. Doi: https://doi.org/10.1016/j.ijar.2017.08.015.

  • Dubois D. and H. Prade. 1992. “Evidence Knowledge and Belief Functions.” International Journal of Approximate Reasoning 6: 295–319. Doi: https://doi.org/10.1016/0888-613X(92)90027-W.

  • Fagin R. and J.Y. Halpern. 1991. “A New Approach to Updating Beliefs.” In Uncertainty in Artificial Intelligence edited by P. Bonissone M. Henrion L. Kanal and J. Lemmer 347–374. New York: Elsevier.

  • Fink P. E. Endres and M. Schmoll. 2019. impimp: Imprecise Imputation for Statistical Matching. https://CRAN.R-project.org/package=impimp. (accessed July 2019).

  • Joenssen D.W.H. 2015. Hot-Deck-Verfahren zur Imputation fehlender Daten – Auswirkungen des Donor-Limits [Hot-Deck Procedures for the Imputation of Missing Data: Effects of the Donor Limit translation by the authors]. Ph. D. thesis Technische Universität Ilmenau. Available at: https://www.db-thueringen.de/receive/dbt_mods_00026076. (accessed July 2019).

  • Kim J.K. and W. Fuller. 2004. “Fractional Hot Deck Imputation.” Biometrika 91: 559–578. Doi: https://doi.org/10.1093/biomet/91.3.559.

  • Lin J. 1991. “Divergence Measures Based on the Shannon Entropy.” IEEE Transactions on Information Theory 37: 145–151. Doi: https://doi.org/10.1109/18.61115.

  • Little R.J.A. and D.B. Rubin. 2002. Statistical Analysis with Missing Data (2nd ed.). Hoboken: Wiley. Doi: https://doi.org/10.1002/9781119013563.

  • Manski C.F. 1995. Identification Problems in the Social Sciences. Cambridge: Harvard University Press.

  • Manski C.F. 2007. Identification for Prediction and Decision. Cambridge: Harvard University Press.

  • Miranda E. I. Couso and P. Gil. 2010. “Approximations of Upper and Lower Probabilities By Measurable Selections.” Information Sciences 180: 1407–1417. Doi: https://doi.org/10.1016/j.ins.2009.12.005.

  • Nguyen H.T. 1978. “On Random Sets and Belief Functions.” Journal of Mathematical Analysis and Applications 65: 531 – 542. Doi: https://doi.org/10.1016/0022-247X(78)90161-0.

  • Nguyen H.T. 2006. An Introduction to Random Sets. Boca Raton: Chapman & Hall/CRC.

  • R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna Austria: R Foundation for Statistical Computing. Available at: https://www.R-project.org/. (accessed July 2019).

  • Ramoni M. and P. Sebastiani. 2001. “Robust Learning with Missing Data.” Machine Learning 45: 147–170. Doi: https://doi.Org/10.1023/A:1010968702992.

  • Rässler S. 2002. Statistical Matching: A Frequentist Theory Practical Applications and Alternative Bayesian Approaches. New York: Springer.

  • Serafino P. and R. Tonkin. 2017. “Statistical Matching of European Union Statistics on Income and Living Conditions (EU-SILC) and the Household Budget Survey.” In Eurostat: Statistical Working Papers. Luxembourg: Publications Office of the European Union. Doi: https://doi.org/10.2785/933460.

  • Shafer G. 1976. A Mathematical Theory of Evidence. Princeton: Princeton University Press.

  • Vantaggi B. 2008. “Statistical Matching of Multiple Sources: A Look Through Coherence.” International Journal of Approximate Reasoning 49: 701–711. Doi: https://doi.org/10.1016/j.ijar.2008.07.005.

  • Walley P. 1991. Statistical Reasoning with Imprecise Probabilities. London: Chapman and Hall.

  • Yang S. and J.K. Kim. 2016. “Fractional Imputation in Survey Sampling: A Comparative Review.” Statistical Science 31: 415–432. Doi: https://doi.org/10.1214/16-STS569.

Search
Journal information
Impact Factor


IMPACT FACTOR 2018: 0,837
5-year IMPACT FACTOR: 0,934

CiteScore 2018: 1.04

SCImago Journal Rank (SJR) 2018: 0.963
Source Normalized Impact per Paper (SNIP) 2018: 1.020

Metrics
All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 63 63 63
PDF Downloads 42 42 42