New data sources, namely big data and the Internet, have become an important issue in statistics and for official statistics in particular. However, before these sources can be used for statistics, it is necessary to conduct a thorough analysis of sources of nonrepresentativeness.
In the article, we focus on detecting correlates of the selection mechanism that underlies Internet data sources for the secondary real estate market in Poland and results in representation errors (frame and selection errors). In order to identify characteristics of properties offered online we link data collected from the two largest advertisements services in Poland and the Register of Real Estate Prices and Values, which covers all transactions made in Poland. Quarterly data for 2016 were linked at a domain level defined by local administrative units (LAU1), the urban/rural distinction and usable floor area (UFA), categorized into four groups. To identify correlates of representation error we used a generalized additive mixed model based on almost 5,500 domains including quarters.
Results indicate that properties not advertised online differ significantly from those shown in the Internet in terms of UFA and location. A non-linear relationship with the average price per m2 can be observed, which diminishes after accounting for LAU1 units.
Beręsewicz, M. 2017. “A Two-Step Procedure to Measure Representativeness of Internet Data Sources.” International Statistical Review 85(3): 473–493. Doi: https://doi.org/10.1111/insr.12217.
Beręsewicz, M., R. Lehtonen, F. Reis, L. Di Consiglio, and M. Karlberg. 2018. An Overview of Methods for Treating Selectivity in Big Data Sources. Statistical Working Papers. Eurostat. Doi: https://doi.org./10.2785/312232.
Chen, B., A. Shrivastava, and R.C. Steorts. 2018. “Unique entity estimation with application to the Syrian conflict.” The Annals of Applied Statistics 12(2): 1039–1067. Doi: https://doi.org/10.1214/18-AOAS1163.
Chen, C., J. Wakefield, and T. Lumely. 2014. “The Use of Sampling Weights in Bayesian Hierarchical Models for Small Area Estimation.” Spatial and Spatio-Temporal Epidemiology 11: 33–43. Doi: https://doi.org/10.1016/j.sste.2014.07.002.
Citro, C.F. 2014. “From Multiple Modes for Surveys to Multiple Data Sources for Estimates.” Survey Methodology 40(2): 137–161.
Daas, P.J., M.J. Puts, B. Buelens, and P.A. van den Hurk. 2015. “Big Data as a Source for Official Statistics.” Journal of Official Statistics 31(2): 249–262. Doi: https://doi.org/10.1515/jos-2015-0016.
Held, L., B. Schrödle, and H. Rue. 2010. “Posterior and Cross-validatory Predictive Checks: A Comparison of MCMC and INLA.” In Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig Fahrmeir, edited by T. Kneib and G. Tutz, 91–110. Heidelberg: Physica-Verlag HD. Doi: https://doi.org/10.1007/978-3-7908-2413-1_6.
Hoekstra, R., O. ten Bosch, and F. Harteveld. 2012. “Automated Data Collection from Web Sources for Official Statistics: First Experiences.” Statistical Journal of the IAOS 28(3, 4): 99–111. Doi: https://doi.org/10.3233/SJI-2012-0750.
Ihlanfeldt, K.R. and J. Martinez-Vazquez. 1986. “Alternative Value Estimates of Owner-occupied Housing: Evidence on Sample Selection Bias and Systematic Errors.” Journal of Urban Economics 20(3): 356–369. Doi: https://doi.org/10.1016/0094-1190(86)90025-2.
Japec, L., F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O’Neil, and A. Usher. 2015. “Big Data in Survey ResearchAAPOR Task Force Report.” Public Opinion Quarterly 79(4): 839–880. Doi: https://dx.doi.org/10.1093/poq/nfv039.
Kiel, K.A. and J.E. Zabel. 1999. “The Accuracy of Owner-provided House Values: The 1978–1991 American Housing Survey.” Real Estate Economics 27(2): 263–298. Doi: https://doi.org/10.1111/1540-6229.00774.
Marra, G., R. Radice, T. Bärnighausen, S.N. Wood, and M.E. McGovern. 2017. “A Simultaneous Equation Approach to Estimating Hiv Prevalence with Nonignorable Missing Responses.” Journal of the American Statistical Association 112(518): 484–496. Doi: https://doi.org/10.1080/01621459.2016.1224713.
Mercer, L., J. Wakefield, C. Chen, and T. Lumley. 2014. “A Comparison of Spatial Smoothing Methods for Small Area Estimation with Sampling Weights.” Spatial Statistics 8: 69–85. Doi: https://10.1016/j.spasta.2013.12.001.
Pfeffermann, D. 2015. “Methodological Issues and Challenges in the Production of Official Statistics: 24th Annual Morris Hansen Lecture.” Journal of Survey Statistics and Methodology 3(4): 425–483. Doi: https://dx.doi.org/10.1093/jssam/smv035.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: www.R-project.org/ (accessed February 2019).
Reid, G., F. Zabala, and A. Holmberg. 2017. “Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ.” Journal of Official Statistics 33(2): 477–511. Doi: https://doi.org/10.1515/jos-2017-0023.
Riddles, M.K., J.K. Kim, and J. Im. 2016. “A Propensity-score-adjustment Method for Non-ignorable Nonresponse.” Journal of Survey Statistics and Methodology 4(2): 215–245. Doi: https://doi.org/10.1093/jssam/smv047.
Rue, H., S. Martino, and N. Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations (with discussion).” Journal of the Royal Statistical Society B 71: 319–392. Doi: https://doi.org/10.1111/j.1467-9868.2008.00700.x.
Simpson, D., H. Rue, A. Riebler, T.G. Martins, S.H. Sørbye, et al. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32(1): 1–28. Doi: https://doi.org/10.1214/16-STS576.
Spiegelhalter, D.J., N.G. Best, B.P. Carlin, and A. Van Der Linde. 2002. “Bayesian Measures of Model Complexity and Fit.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(4): 583–639. Doi: https://doi.org/10.1111/1467-9868.00353.
Steorts, R.C., R. Hall, and S.E. Fienberg. 2016. “A Bayesian Approach to Graphical Record Linkage and Deduplication.” Journal of the American Statistical Association 111(516): 1660–1672. Doi: https://doi.org/10.1080/01621459.2015.1105807.
Sverchkov, M. and D. Pfeffermann. 2018. “Small Area Estimation Under Informative Sampling and Not Missing At Random Non-response.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181(4): 981–1008. Doi: https://doi.org/10.1111/rssa.12362.
Wallgren, A. and B. Wallgren. 2014. Register-based Statistics: Statistical Methods for Administrative Data. New York: Wiley.
Watanabe, S. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11(Dec): 3571–3594. Available at: http://www.jmlr.org/papers/v11/watanabe10a.html (accessed February 2019).