New data sources, namely big data and the Internet, have become an important issue in statistics and for official statistics in particular. However, before these sources can be used for statistics, it is necessary to conduct a thorough analysis of sources of nonrepresentativeness.
In the article, we focus on detecting correlates of the selection mechanism that underlies Internet data sources for the secondary real estate market in Poland and results in representation errors (frame and selection errors). In order to identify characteristics of properties offered online we link data collected from the two largest advertisements services in Poland and the Register of Real Estate Prices and Values, which covers all transactions made in Poland. Quarterly data for 2016 were linked at a domain level defined by local administrative units (LAU1), the urban/rural distinction and usable floor area (UFA), categorized into four groups. To identify correlates of representation error we used a generalized additive mixed model based on almost 5,500 domains including quarters.
Results indicate that properties not advertised online differ significantly from those shown in the Internet in terms of UFA and location. A non-linear relationship with the average price per m2 can be observed, which diminishes after accounting for LAU1 units.
If the inline PDF is not rendering correctly, you can download the PDF file here.
Anenberg E. and S. Laufer. 2017. “A More Timely House Price Index.” Review of Economics and Statistics 99(4): 722–734. Doi: https://doi.org/10.1162/REST_a_00634.
Beręsewicz M. 2016. Internet Data Sources for Real Estate Market Statistics. PhD diss. Poznań University of Economics and Business. Available at: http://www.wbc.poznan.pl/dlibra/docmetadata?id=393454 (accessed February 2019).
Beręsewicz M. 2017. “A Two-Step Procedure to Measure Representativeness of Internet Data Sources.” International Statistical Review 85(3): 473–493. Doi: https://doi.org/10.1111/insr.12217.
Beręsewicz M. R. Lehtonen F. Reis L. Di Consiglio and M. Karlberg. 2018. An Overview of Methods for Treating Selectivity in Big Data Sources. Statistical Working Papers. Eurostat. Doi: https://doi.org./10.2785/312232.
Brick J.M. 2015. “Unit Nonresponse and Weighting Adjustments: A Critical Review.” Journal of Official Statistics 29(3): 329–353. Doi: https://doi.org/10.2478/jos-2013-0026.
Buelens B. P. Daas J. Burger M. Puts and J. van den Brakel. 2014. Selectivity of Big Data. Discussion paper 201411. Statistics Netherlands The Hague/Heerlen The Netherlands. Available at: http://pietdaas.nl/beta/pubs/pubs/Selectivity_Buelens.pdf (accessed February 2019).
Cavallo A. 2013. “Online and Official Price Indexes: Measuring Argentina’s Inflation.” Journal of Monetary Economics 60(2): 152–165. Doi: https://doi.org/10.1016/j.jmoneco.2012.10.002.
Chen B. A. Shrivastava and R.C. Steorts. 2018. “Unique entity estimation with application to the Syrian conflict.” The Annals of Applied Statistics 12(2): 1039–1067. Doi: https://doi.org/10.1214/18-AOAS1163.
Chen C. J. Wakefield and T. Lumely. 2014. “The Use of Sampling Weights in Bayesian Hierarchical Models for Small Area Estimation.” Spatial and Spatio-Temporal Epidemiology 11: 33–43. Doi: https://doi.org/10.1016/j.sste.2014.07.002.
Citro C.F. 2014. “From Multiple Modes for Surveys to Multiple Data Sources for Estimates.” Survey Methodology 40(2): 137–161.
Daas P.J. M.J. Puts B. Buelens and P.A. van den Hurk. 2015. “Big Data as a Source for Official Statistics.” Journal of Official Statistics 31(2): 249–262. Doi: https://doi.org/10.1515/jos-2015-0016.
ESSnet Big Data. 2018. “ESSnet Big Data.” Available at: https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/ESSnet_Big_Data (accessed February 2018).
Faraway J.J. X. Wang and Y.Y. Ryan. 2018. Bayesian Regression Modeling with INLA. Chapman/Hall/CRC.
Fleishman L. and Y. Gubman. 2015. “Mass Appraisal at the Census Level: Israeli Case.” Statistical Journal of the IAOS 31(4): 597–612. Doi: https://doi.org/10.3233/SJI-150939.
Gelman A. J. Hwang and A. Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24(6): 997–1016. Doi: https://doi.org/10.1007/s11222-013-9416-2.
Heckman J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica 47: 153–161. Doi: https://www.jstor.org/stable/1912352.
Held L. B. Schrödle and H. Rue. 2010. “Posterior and Cross-validatory Predictive Checks: A Comparison of MCMC and INLA.” In Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig Fahrmeir edited by T. Kneib and G. Tutz 91–110. Heidelberg: Physica-Verlag HD. Doi: https://doi.org/10.1007/978-3-7908-2413-1_6.
Hoekstra R. O. ten Bosch and F. Harteveld. 2012. “Automated Data Collection from Web Sources for Official Statistics: First Experiences.” Statistical Journal of the IAOS 28(3 4): 99–111. Doi: https://doi.org/10.3233/SJI-2012-0750.
Ihlanfeldt K.R. and J. Martinez-Vazquez. 1986. “Alternative Value Estimates of Owner-occupied Housing: Evidence on Sample Selection Bias and Systematic Errors.” Journal of Urban Economics 20(3): 356–369. Doi: https://doi.org/10.1016/0094-1190(86)90025-2.
Japec L. F. Kreuter M. Berg P. Biemer P. Decker C. Lampe J. Lane C. O’Neil and A. Usher. 2015. “Big Data in Survey ResearchAAPOR Task Force Report.” Public Opinion Quarterly 79(4): 839–880. Doi: https://dx.doi.org/10.1093/poq/nfv039.
Kiel K.A. and J.E. Zabel. 1999. “The Accuracy of Owner-provided House Values: The 1978–1991 American Housing Survey.” Real Estate Economics 27(2): 263–298. Doi: https://doi.org/10.1111/1540-6229.00774.
Lindgren F. and H. Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63(19): 1–25. Doi: https://doi.org/10.18637/jss.v063.i19.
Lohr S.L. and T.E. Raghunathan. 2017. “Combining Survey Data with Other Data Sources.” Statist. Sci. 32(2) (May): 293–312. Doi: https://doi.org/10.1214/16-STS584.
Lozano-Gracia N. and L. Anselin. 2012. “Is the Price Right?: Assessing Estimates of Cadastral Values for Bogotá Colombia.” Regional Science Policy & Practice 4(4): 495–508. Doi: https://doi.org/10.1111/j.1757-7802.2012.01062.x.
Marra G. R. Radice T. Bärnighausen S.N. Wood and M.E. McGovern. 2017. “A Simultaneous Equation Approach to Estimating Hiv Prevalence with Nonignorable Missing Responses.” Journal of the American Statistical Association 112(518): 484–496. Doi: https://doi.org/10.1080/01621459.2016.1224713.
Mercer L. J. Wakefield C. Chen and T. Lumley. 2014. “A Comparison of Spatial Smoothing Methods for Small Area Estimation with Sampling Weights.” Spatial Statistics 8: 69–85. Doi: https://10.1016/j.spasta.2013.12.001.
Pfeffermann D. 2015. “Methodological Issues and Challenges in the Production of Official Statistics: 24th Annual Morris Hansen Lecture.” Journal of Survey Statistics and Methodology 3(4): 425–483. Doi: https://dx.doi.org/10.1093/jssam/smv035.
R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna Austria: R Foundation for Statistical Computing. Available at: www.R-project.org/ (accessed February 2019).
Reid G. F. Zabala and A. Holmberg. 2017. “Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ.” Journal of Official Statistics 33(2): 477–511. Doi: https://doi.org/10.1515/jos-2017-0023.
Riddles M.K. J.K. Kim and J. Im. 2016. “A Propensity-score-adjustment Method for Non-ignorable Nonresponse.” Journal of Survey Statistics and Methodology 4(2): 215–245. Doi: https://doi.org/10.1093/jssam/smv047.
Rue H. S. Martino and N. Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations (with discussion).” Journal of the Royal Statistical Society B 71: 319–392. Doi: https://doi.org/10.1111/j.1467-9868.2008.00700.x.
Sikov A. 2018. “A Brief Review of Approaches to Non-ignorable Non-response.” International Statistical Review 86(3): 415–441. Doi: https://doi.org/10.1111/insr.12264.
Simpson D. H. Rue A. Riebler T.G. Martins S.H. Sørbye et al. 2017. “Penalising Model Component Complexity: A Principled Practical Approach to Constructing Priors.” Statistical Science 32(1): 1–28. Doi: https://doi.org/10.1214/16-STS576.
Spiegelhalter D.J. N.G. Best B.P. Carlin and A. Van Der Linde. 2002. “Bayesian Measures of Model Complexity and Fit.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(4): 583–639. Doi: https://doi.org/10.1111/1467-9868.00353.
Statistics Netherlands. 2018. Indicatoren bestaande woningen in verkoop. Available at: https://www.cbs.nl/nl-nl/onze-diensten/methoden/onderzoeksomschrijvingen/korte-onderzoeksbeschrijvingen/indicatoren-bestaande-woningen-in-verkoop (accessed November 2018).
Steorts R.C. R. Hall and S.E. Fienberg. 2016. “A Bayesian Approach to Graphical Record Linkage and Deduplication.” Journal of the American Statistical Association 111(516): 1660–1672. Doi: https://doi.org/10.1080/01621459.2015.1105807.
Sverchkov M. and D. Pfeffermann. 2018. “Small Area Estimation Under Informative Sampling and Not Missing At Random Non-response.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181(4): 981–1008. Doi: https://doi.org/10.1111/rssa.12362.
Wallgren A. and B. Wallgren. 2014. Register-based Statistics: Statistical Methods for Administrative Data. New York: Wiley.
Watanabe S. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11(Dec): 3571–3594. Available at: http://www.jmlr.org/papers/v11/watanabe10a.html (accessed February 2019).
Zhang L.-C. 2012. “Topics of Statistical Theory for Register-based Statistics and Data Integration.” Statistica Neerlandica 66(1): 41–63. Doi: https://doi.org/10.1111/j.1467-9574.2011.00508.x.