Correlates of Representation Errors in Internet Data Sources for Real Estate Market

  • 1 Poznań University of Economics and Business, Department of Statistics, 61-875, Poznań, Poland


New data sources, namely big data and the Internet, have become an important issue in statistics and for official statistics in particular. However, before these sources can be used for statistics, it is necessary to conduct a thorough analysis of sources of nonrepresentativeness.

In the article, we focus on detecting correlates of the selection mechanism that underlies Internet data sources for the secondary real estate market in Poland and results in representation errors (frame and selection errors). In order to identify characteristics of properties offered online we link data collected from the two largest advertisements services in Poland and the Register of Real Estate Prices and Values, which covers all transactions made in Poland. Quarterly data for 2016 were linked at a domain level defined by local administrative units (LAU1), the urban/rural distinction and usable floor area (UFA), categorized into four groups. To identify correlates of representation error we used a generalized additive mixed model based on almost 5,500 domains including quarters.

Results indicate that properties not advertised online differ significantly from those shown in the Internet in terms of UFA and location. A non-linear relationship with the average price per m2 can be observed, which diminishes after accounting for LAU1 units.

If the inline PDF is not rendering correctly, you can download the PDF file here.

  • Anenberg, E. and S. Laufer. 2017. “A More Timely House Price Index.” Review of Economics and Statistics 99(4): 722–734. Doi:

  • Beręsewicz, M. 2016. Internet Data Sources for Real Estate Market Statistics. PhD diss., Poznań University of Economics and Business. Available at: (accessed February 2019).

  • Beręsewicz, M. 2017. “A Two-Step Procedure to Measure Representativeness of Internet Data Sources.” International Statistical Review 85(3): 473–493. Doi:

  • Beręsewicz, M., R. Lehtonen, F. Reis, L. Di Consiglio, and M. Karlberg. 2018. An Overview of Methods for Treating Selectivity in Big Data Sources. Statistical Working Papers. Eurostat. Doi:

  • Brick, J.M. 2015. “Unit Nonresponse and Weighting Adjustments: A Critical Review.” Journal of Official Statistics 29(3): 329–353. Doi:

  • Buelens, B., P. Daas, J. Burger, M. Puts, and J. van den Brakel. 2014. Selectivity of Big Data. Discussion paper 201411. Statistics Netherlands, The Hague/Heerlen, The Netherlands. Available at: (accessed February 2019).

  • Cavallo, A. 2013. “Online and Official Price Indexes: Measuring Argentina’s Inflation.” Journal of Monetary Economics 60(2): 152–165. Doi:

  • Chen, B., A. Shrivastava, and R.C. Steorts. 2018. “Unique entity estimation with application to the Syrian conflict.” The Annals of Applied Statistics 12(2): 1039–1067. Doi:

  • Chen, C., J. Wakefield, and T. Lumely. 2014. “The Use of Sampling Weights in Bayesian Hierarchical Models for Small Area Estimation.” Spatial and Spatio-Temporal Epidemiology 11: 33–43. Doi:

  • Citro, C.F. 2014. “From Multiple Modes for Surveys to Multiple Data Sources for Estimates.” Survey Methodology 40(2): 137–161.

  • Daas, P.J., M.J. Puts, B. Buelens, and P.A. van den Hurk. 2015. “Big Data as a Source for Official Statistics.” Journal of Official Statistics 31(2): 249–262. Doi:

  • ESSnet Big Data. 2018. “ESSnet Big Data.” Available at: (accessed February 2018).

  • Faraway, J.J., X. Wang, and Y.Y. Ryan. 2018. Bayesian Regression Modeling with INLA. Chapman/Hall/CRC.

  • Fleishman, L. and Y. Gubman. 2015. “Mass Appraisal at the Census Level: Israeli Case.” Statistical Journal of the IAOS 31(4): 597–612. Doi:

  • Gelman, A., J. Hwang, and A. Vehtari. 2014. “Understanding Predictive Information Criteria for Bayesian Models.” Statistics and Computing 24(6): 997–1016. Doi:

  • Heckman, J. 1979. “Sample Selection Bias as a Specification Error.” Econometrica 47: 153–161. Doi:

  • Held, L., B. Schrödle, and H. Rue. 2010. “Posterior and Cross-validatory Predictive Checks: A Comparison of MCMC and INLA.” In Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig Fahrmeir, edited by T. Kneib and G. Tutz, 91–110. Heidelberg: Physica-Verlag HD. Doi:

  • Hoekstra, R., O. ten Bosch, and F. Harteveld. 2012. “Automated Data Collection from Web Sources for Official Statistics: First Experiences.” Statistical Journal of the IAOS 28(3, 4): 99–111. Doi:

  • Ihlanfeldt, K.R. and J. Martinez-Vazquez. 1986. “Alternative Value Estimates of Owner-occupied Housing: Evidence on Sample Selection Bias and Systematic Errors.” Journal of Urban Economics 20(3): 356–369. Doi:

  • Japec, L., F. Kreuter, M. Berg, P. Biemer, P. Decker, C. Lampe, J. Lane, C. O’Neil, and A. Usher. 2015. “Big Data in Survey ResearchAAPOR Task Force Report.” Public Opinion Quarterly 79(4): 839–880. Doi:

  • Kiel, K.A. and J.E. Zabel. 1999. “The Accuracy of Owner-provided House Values: The 1978–1991 American Housing Survey.” Real Estate Economics 27(2): 263–298. Doi:

  • Lindgren, F. and H. Rue. 2015. “Bayesian Spatial Modelling with R-INLA.” Journal of Statistical Software 63(19): 1–25. Doi:

  • Lohr, S.L. and T.E. Raghunathan. 2017. “Combining Survey Data with Other Data Sources.” Statist. Sci. 32(2) (May): 293–312. Doi:

  • Lozano-Gracia, N. and L. Anselin. 2012. “Is the Price Right?: Assessing Estimates of Cadastral Values for Bogotá, Colombia.” Regional Science Policy & Practice 4(4): 495–508. Doi:

  • Marra, G., R. Radice, T. Bärnighausen, S.N. Wood, and M.E. McGovern. 2017. “A Simultaneous Equation Approach to Estimating Hiv Prevalence with Nonignorable Missing Responses.” Journal of the American Statistical Association 112(518): 484–496. Doi:

  • Mercer, L., J. Wakefield, C. Chen, and T. Lumley. 2014. “A Comparison of Spatial Smoothing Methods for Small Area Estimation with Sampling Weights.” Spatial Statistics 8: 69–85. Doi: https://10.1016/j.spasta.2013.12.001.

  • Pfeffermann, D. 2015. “Methodological Issues and Challenges in the Production of Official Statistics: 24th Annual Morris Hansen Lecture.” Journal of Survey Statistics and Methodology 3(4): 425–483. Doi:

  • R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Available at: (accessed February 2019).

  • Reid, G., F. Zabala, and A. Holmberg. 2017. “Extending TSE to Administrative Data: A Quality Framework and Case Studies from Stats NZ.” Journal of Official Statistics 33(2): 477–511. Doi:

  • Riddles, M.K., J.K. Kim, and J. Im. 2016. “A Propensity-score-adjustment Method for Non-ignorable Nonresponse.” Journal of Survey Statistics and Methodology 4(2): 215–245. Doi:

  • Rue, H., S. Martino, and N. Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations (with discussion).” Journal of the Royal Statistical Society B 71: 319–392. Doi:

  • Sikov, A. 2018. “A Brief Review of Approaches to Non-ignorable Non-response.” International Statistical Review 86(3): 415–441. Doi:

  • Simpson, D., H. Rue, A. Riebler, T.G. Martins, S.H. Sørbye, et al. 2017. “Penalising Model Component Complexity: A Principled, Practical Approach to Constructing Priors.” Statistical Science 32(1): 1–28. Doi:

  • Spiegelhalter, D.J., N.G. Best, B.P. Carlin, and A. Van Der Linde. 2002. “Bayesian Measures of Model Complexity and Fit.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(4): 583–639. Doi:

  • Statistics Netherlands. 2018. Indicatoren bestaande woningen in verkoop. Available at: (accessed November 2018).

  • Steorts, R.C., R. Hall, and S.E. Fienberg. 2016. “A Bayesian Approach to Graphical Record Linkage and Deduplication.” Journal of the American Statistical Association 111(516): 1660–1672. Doi:

  • Sverchkov, M. and D. Pfeffermann. 2018. “Small Area Estimation Under Informative Sampling and Not Missing At Random Non-response.” Journal of the Royal Statistical Society: Series A (Statistics in Society) 181(4): 981–1008. Doi:

  • Wallgren, A. and B. Wallgren. 2014. Register-based Statistics: Statistical Methods for Administrative Data. New York: Wiley.

  • Watanabe, S. 2010. “Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory.” Journal of Machine Learning Research 11(Dec): 3571–3594. Available at: (accessed February 2019).

  • Zhang, L.-C. 2012. “Topics of Statistical Theory for Register-based Statistics and Data Integration.” Statistica Neerlandica 66(1): 41–63. Doi:


Journal + Issues