The problem of inference about the joint distribution of two categorical variables based on knowledge or observations of their marginal distributions, to be referred to as categorical data fusion in this paper, is relevant in statistical matching, ecological inference, market research, and several other related fields. This article organizes the use of proxy variables, to be distinguished from other auxiliary variables, both in terms of their effects on the uncertainty of fusion and the techniques of fusion. A measure of the gains of efficiency is provided, which incorporates both the identification uncertainty associated with data fusion and the sampling uncertainty that arises when the theoretical bounds of the uncertainty space are unknown and need to be estimated. Several existing techniques for generating fusion distributions (or datasets) are described and some new ones proposed. Analysis of real-life data demonstrates empirically that proxy variables can make data fusion more precise and the constructed fusion distribution more plausible.
Brozzi, A., A. Capotorti, and B. Vantaggi. 2012. “Incoherence Correction Strategies in Statistical Matching.” International Journal of Approximate Reasoning 53: 1124–1136. Doi: http://dx.doi.org/10.1016/j.ijar.2012.06.009.
Conti, P.L., D. Marella, and M. Scanu. 2008. “Evaluation of Matching Noise for Imputation Techniques Based on Nonparametric Local Linear Regression Estimators.” Computational Statistics & Data Analysis 53: 354–365. Doi: http://dx.doi.org/10.1016/j.csda.2008.07.041.
Conti, P.L., M. Di Zio, D. Marella, and M. Scanu. 2009. “Uncertainty Analysis in Statistical Matching.” Paper given at the First Italian Conference on Survey Methodology (ITACOSM09), June 10–12, 2009, Siena
Conti, P.L., D. Marella, and M. Scanu. 2012. “Uncertainty Analysis in Statistical Matching.” Journal of Official Statistics 28: 69–88.
Conti, P.L., D. Marella, and M. Scanu. 2013. “Uncertainty Analysis for Statistical Matching of Ordered Categorical Variables.” Computational Statistics & Data Analysis 68: 311–325. Doi: http://dx.doi.org/10.1016/j.csda.2013.07.004.
Chambers, R.L. and R.G. Steel. 2001. “Simple Methods for Ecological Inference in 2 × 2 Tables.” Journal of the Royal Statistical Society Series A 164: 175–192. Doi: http://dx.doi.org/10.1111/1467-985X.00195.
D’Orazio, M., M. Di Zio, and M. Scanu. 2006a. “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints.” Journal of Official Statistics 22: 137–157.
D’Orazio, M., M. Di Zio, and M. Scanu. 2006b. Statistical Matching: Theory and Practice. Chichester: Wiley.
Kadane, J.B. 1978. “Some Statistical Problems in Merging Data Files.” In 1978 Compendium of Tax Research, (pp. 159–171). Washington, D.C. Department of Treasury. (Reprinted in Journal of Official Statistics 17: 423–433.).
King, G. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press.
Lindley, D.V., A. Tversky, and R.V. Brown. 1979. “On the Reconciliation of Probability Assessments (incl. discussions).” Journal of the Royal Statistical Society Series A 142: 146–180. Doi: http://dx.doi.org/10.2307/2345078.
Manski, C.F. 1995. Identification Problems in the Social Sciences. Cambridge, MA: Harvard University Press.
Moriarity, C. and F. Scheuren. 2001. “Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure.” Journal of Official Statistics 17: 407–422.
Nadarajah, S. and S. Kotz. 2008. “Exact Distribution of the Max/Min of Two Gaussian Random Variables.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16: 210–212. Doi: http://dx.doi.org/10.1109/TVLSI.2007.912191.
Okner, B.A. 1972. “Constructing a New Microdata Base From Existing Microdatasets: the 1966 Merge File.” Annals of Economic and Social Measurement 1: 325–342.
Patel, J.K., C.H. Kapadia, and D.B. Owen. 1976. Handbook of Statistical Distributions. New York: Marcel Dekker.
Purcell, N.J. and L. Kish. 1980. “Postcensal Estimates for Local Areas (or Domains).” International Statistical Review 48: 3–18. Doi: http://dx.doi.org/10.2307/1402400.
Rässler, S. 2002. Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches, Vol. 168 of Lecture Notes in Statistics. New York: Springer Verlag.
Rässler, S. and H. Kiesl. 2009. “How Useful Are Uncertainty Bounds? Some Recent Theory With an Application to Rubin’s Causal Model.” In Proceedings of the 57th Sessions of the International Statistical Institute. (2009) CD-ROM. Durban, South Africa.
Singh, A.C., H. Mantel, M. Kinack, and G. Rowe. 1993. “Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption.” Survey Methodology 19: 57–79.