On Proxy Variables and Categorical Data Fusion

Open access


The problem of inference about the joint distribution of two categorical variables based on knowledge or observations of their marginal distributions, to be referred to as categorical data fusion in this paper, is relevant in statistical matching, ecological inference, market research, and several other related fields. This article organizes the use of proxy variables, to be distinguished from other auxiliary variables, both in terms of their effects on the uncertainty of fusion and the techniques of fusion. A measure of the gains of efficiency is provided, which incorporates both the identification uncertainty associated with data fusion and the sampling uncertainty that arises when the theoretical bounds of the uncertainty space are unknown and need to be estimated. Several existing techniques for generating fusion distributions (or datasets) are described and some new ones proposed. Analysis of real-life data demonstrates empirically that proxy variables can make data fusion more precise and the constructed fusion distribution more plausible.

Brozzi, A., A. Capotorti, and B. Vantaggi. 2012. “Incoherence Correction Strategies in Statistical Matching.” International Journal of Approximate Reasoning 53: 1124–1136. Doi: http://dx.doi.org/10.1016/j.ijar.2012.06.009.

Conti, P.L., D. Marella, and M. Scanu. 2008. “Evaluation of Matching Noise for Imputation Techniques Based on Nonparametric Local Linear Regression Estimators.” Computational Statistics & Data Analysis 53: 354–365. Doi: http://dx.doi.org/10.1016/j.csda.2008.07.041.

Conti, P.L., M. Di Zio, D. Marella, and M. Scanu. 2009. “Uncertainty Analysis in Statistical Matching.” Paper given at the First Italian Conference on Survey Methodology (ITACOSM09), June 10–12, 2009, Siena

Conti, P.L., D. Marella, and M. Scanu. 2012. “Uncertainty Analysis in Statistical Matching.” Journal of Official Statistics 28: 69–88.

Conti, P.L., D. Marella, and M. Scanu. 2013. “Uncertainty Analysis for Statistical Matching of Ordered Categorical Variables.” Computational Statistics & Data Analysis 68: 311–325. Doi: http://dx.doi.org/10.1016/j.csda.2013.07.004.

Cain, M. 1994. “The Moment-generating Function of the Minimum of Bivariate Normal Random Variables.” The American Statistician 48: 124–125. Doi: http://dx.doi.org/10.1080/00031305.1994.10476039.

Chambers, R.L. and R.G. Steel. 2001. “Simple Methods for Ecological Inference in 2 × 2 Tables.” Journal of the Royal Statistical Society Series A 164: 175–192. Doi: http://dx.doi.org/10.1111/1467-985X.00195.

D’Orazio, M., M. Di Zio, and M. Scanu. 2006a. “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints.” Journal of Official Statistics 22: 137–157.

D’Orazio, M., M. Di Zio, and M. Scanu. 2006b. Statistical Matching: Theory and Practice. Chichester: Wiley.

Kadane, J.B. 1978. “Some Statistical Problems in Merging Data Files.” In 1978 Compendium of Tax Research, (pp. 159–171). Washington, D.C. Department of Treasury. (Reprinted in Journal of Official Statistics 17: 423–433.).

King, G. 1997. A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press.

Koopmans, T. 1949. “Identification Problems in Economic Model Construction.” Econometrica 17: 125–144. Doi: http://dx.doi.org/10.2307/1905689.

Lindley, D.V., A. Tversky, and R.V. Brown. 1979. “On the Reconciliation of Probability Assessments (incl. discussions).” Journal of the Royal Statistical Society Series A 142: 146–180. Doi: http://dx.doi.org/10.2307/2345078.

Manski, C.F. 1995. Identification Problems in the Social Sciences. Cambridge, MA: Harvard University Press.

Marella, D., P.L. Conti, and M. Scanu. 2008. “On the Matching Noise of Some Nonparametric Imputation Procedures.” Statistics and Probability Letters 78: 1593–1600. Doi: http://dx.doi.org/10.1016/j.spl.2008.01.020.

Moriarity, C. and F. Scheuren. 2001. “Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure.” Journal of Official Statistics 17: 407–422.

Nadarajah, S. and S. Kotz. 2008. “Exact Distribution of the Max/Min of Two Gaussian Random Variables.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems 16: 210–212. Doi: http://dx.doi.org/10.1109/TVLSI.2007.912191.

Okner, B.A. 1972. “Constructing a New Microdata Base From Existing Microdatasets: the 1966 Merge File.” Annals of Economic and Social Measurement 1: 325–342.

Patel, J.K., C.H. Kapadia, and D.B. Owen. 1976. Handbook of Statistical Distributions. New York: Marcel Dekker.

Plackett, R.L. 1977. “The Marginal Totals of a 2 × 2 Table.” Biometrika 64: 37–42. Doi: http://dx.doi.org/10.1093/biomet/64.1.37.

Purcell, N.J. and L. Kish. 1980. “Postcensal Estimates for Local Areas (or Domains).” International Statistical Review 48: 3–18. Doi: http://dx.doi.org/10.2307/1402400.

Rässler, S. 2002. Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches, Vol. 168 of Lecture Notes in Statistics. New York: Springer Verlag.

Rässler, S. and H. Kiesl. 2009. “How Useful Are Uncertainty Bounds? Some Recent Theory With an Application to Rubin’s Causal Model.” In Proceedings of the 57th Sessions of the International Statistical Institute. (2009) CD-ROM. Durban, South Africa.

Singh, A.C., H. Mantel, M. Kinack, and G. Rowe. 1993. “Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption.” Survey Methodology 19: 57–79.

Vantaggi, B. 2008. “Statistical Matching of Multiple Sources: A Look Through Coherence.” International Journal of Approximate Reasoning 49: 701–711. Doi: http://dx.doi.org/10.1016/j.ijar.2008.07.005.

Wakefield, J. 2004. “Ecological Inference for 2 × 2 Tables (incl. discussions).” Journal of the Royal Statistical Society Series A 167: 385–445. Doi: http://dx.doi.org/10.1111/j.1467-985x.2004.02046.x.

Journal of Official Statistics

The Journal of Statistics Sweden

Journal Information

IMPACT FACTOR 2017: 0.662
5-year IMPACT FACTOR: 1.113

CiteScore 2017: 0.74

SCImago Journal Rank (SJR) 2017: 1.158
Source Normalized Impact per Paper (SNIP) 2017: 0.860


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 466 466 37
PDF Downloads 79 79 9