More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. This article discusses the exploration of both opportunities and challenges for official statistics associated with the application of Big Data. Experiences gained with analyses of large amounts of Dutch traffic loop detection records and Dutch social media messages are described to illustrate the topics characteristic of the statistical analysis and use of Big Data.
ASA. 2014. Discovery With Data: Leveraging Statistics with Computer Science to Transform Science and Society. July 2, 2014 version. Available at: http://www.amstat.org/policy/pdfs/BigDataStatisticsJune2014.pdf (accessed July 2014).
Beyer, M.A. and L. Douglas. 2012. The Importance of ‘Big Data’: A Definition. Gartner report, June version, ID Number: G00235055. Available at: http://www.gartner.com/it-glossary/big-data/ (accessed January 2013).
Breiman, L. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16: 99-231. Doi: http://dx.doi.org/10.1214/ss/1009213726.
Buelens, B., H.J. Boonstra, J. van den Brakel, and P. Daas. 2012. Shifting Paradigms in Official Statistics: from Design-Based to Model-Based to Algorithmic Inference. Discussion paper 201218, Statistics Netherlands, The Hague/Heerlen.
Buelens, B., P. Daas, J. Burger, M. Puts, and J. van den Brakel. 2014. Selectivity of Big Data. Discussion paper 201411, Statistics Netherlands, The Hague/Heerlen, The Netherlands.
Cheung, P. 2012. Big Data, Official Statistics and Social Science Research: Emerging Data Challenges. Presentation at the December 19th World Bank meeting, Washington.Available at: http://www.worldbank.org/wb/Big-data-pc-2012-12-12.pdf (accessed January 2013).
Coosto. 2013. Main page. Available at: http://www.coosto.com/uk/ (accessed August 2013).
Daas, P.J.H. and M.J.H. Puts. 2014. Social Media Sentiment and Consumer Confidence.Paper for the Workshop on using Big Data for Forecasting and Statistics, April 7-8, Frankfurt, Germany. Available at: https://www.ecb.europa.eu/pub/pdf/scpsps/ecbsp5.pdf (accessed April 2015).
Daas, P.J.H., M. Roos, M. van de Ven, and J. Neroni. 2012a. Twitter as a Potential Data Source for Statistics. Discussion paper 201221, The Hague/Heerlen: Statistics Netherlands.
Daas, P., M. Tennekes, E. de Jonge, A. Priem, B. Buelens, M. van Pelt, and P. van den Hurk. 2012b. Data Science and the Future of Statistics. Presentation at the first Data Science NL meetup, Utrecht University, Utrecht. Available at: http://www.slideshare.net/pietdaas/data-science-and-the-future-of-statistics (accessed December 2012).
De Jonge, E., M. van Pelt, and M. Roos. 2012. Time Patterns, Geospatial Clustering and Mobility Statistics Based on Mobile Phone Network Data. Discussion paper 201214, The Hague/Heerlen: Statistics Netherlands.
De Jonge, E., J. Wijffels, and J. van der Laan. 2014. “ffbase: Basic Statistical Functions for Package ff. R package version 0.11.3.” Available at: http://cran.r-project.org/web/packages/ffbase/index.html (accessed April 2015).
De Waal, T., J. Pannekoek, and S. Scholtus. 2011. Handbook of Statistical Editing and Imputation. Hoboken, NJ: John Wiley & Sons.
Engle, R.F. and C.W.J. Granger. 1987. “Co-Integration and Error Correction: Representation, Estimation, and Testing.” Econometrica 55: 251-276.
Eurostat. 2012. Internet Access and Use. Eurostat newsrelease 185/2012, December 18, 2012. Available at: http://epp.eurostat.ec.europa.eu/cache/ITY_PUBLIC/4-18122012-AP/EN/4-18122012-AP-EN.PDF (accessed January 2013).
Flekova, L. and I. Gurevych. 2013. Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media. Paper for the evaluation lab on uncovering plagiarism, authorship, and social software misuse at Conference and Labs Evaluation Forum 2013, September 23-26, Valencia, Spain.
Fry, B. 2008. Visualizing Data: Exploring and Explaining Data with the Processing Environment. Sebastopol, CA: O’Reilly Media Inc.
Glasson, M., J. Trepanier, V. Patruno, P. Daas, M. Skaliotis, and A. Khan. 2013. What does “Big Data” mean for Official Statistics? Paper for the High-Level Group for the Modernization of Statistical Production and Services, March 10.
Golder, S.A. and M.W. Macy. 2011. “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures.” Science 30: 1878-1881. Doi: http://dx.doi.org/10.1126/science.1202775.
Groves, R.M. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75: 861-871. Doi: http://dx.doi.org/10.1093/poq/nfr057.
Hassani, H., G. Saporta, and E. Sirimal Silvia. 2014. “Data Mining and Official Statistics: The Past, the Present and the Future.” Big Data 2: 1-10. Doi: http://dx.doi.org/10.1089/big.2013.0038.
Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer Science þ Business Media, LLC.
Lansdall-Welfare, T., V. Lampos, and N. Cristianini. 2012. “Nowcasting the Mood of the Nation.” Significance 9: 26-28. Available at: http://www.significancemagazine.org/details/magazine/2468761/Nowcasting-the-mood-of-the-nation.html (accessed January 2013).
Lynch, C. 2008. “Big Data: How Do Your Data Grow?” Nature 455: 28-29. Doi: http:// dx.doi.org/10.1038/455028a.
Manton, J.H., V. Krishnamurthy, and R.J. Elliott. 1999. “Discrete Time Filters for Double Stochastic Poisson Processes and Other Exponential Noise Models.” International Journal of Adaptive Control and Signal Processing 13: 393-416.
Manyika, J., M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Hung Byers. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. Report of the McKinsey Global Institute, McKinsey & Company.
NAS. 2013. Frontiers in Massive Data Analysis. Washington, DC: The National Academies Press.
NDW. 2012. The Database Explained. Brochure of the National Data Warehouse for Traffic Information, March. Available at: http://www.ndw.nu/download_files.php?action¼download_file&file_hash¼209140a807e959f06646b0311f79de26 (accessed December 2012).
O’Connor, B., R. Balasubramanyan, B.R. Routledge, and N.A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Carnegie Mellon University, Research Showcase. Available at: www.cs.cmu.edu/,nasmith/papers/oconnorþbalasubramanyanþroutledgeþsmith.icwsm10.pdf (accessed April 2015).
R Development Core Team. 2012. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna.
Rajaraman, A. and J.D. Ullman. 2011. Mining of Massive Datasets. Cambridge: Cambridge University Press.
Schutt, R. and C. O’Neil. 2013. Doing Data Science: Straight Talk from the Frontline.
Sebastopol, CA: O’Reilly Media. Scott, S.L., A.W. Blocker, F.V. Bonassi, H.A. Chipman, E.I. George, and R.E. McCulloch. 2013. Bayes and Big Data: The Consensus Monte Carlo Algorithm. Bayes 250. Available at: http://www.rob-mcculloch.org/some_papers_and_talks/papers/working/consensus-mc.pdf (accessed April 2015).
Statistics Netherlands. 2013. Consumer Confidence Survey. Available at: http://www.cbs.nl/en-GB/menu/methoden/dataverzameling/consumenten-conjunctuuronderzoek-cco.htm (accessed April 2013).
Struijs, P. and P.J.H. Daas. 2013. Big Data, Big Impact? Paper for the Seminar on Statistical Data Collection, September 25-27, Geneva. Switzerland
Tennekes, M., E. de Jonge, and P.J.H. Daas. 2013. “Visualizing and Inspecting Large Datasets with Tableplots.” Journal of Data Science 11: 43-58.
Van der Laan, J. 2013. LaF: Fast Access to Large ASCII files. R package version 0.5.
Zikopoulos, P., D. deRoos, K. Parasuraman, T. Deutsch, D. Corrigan, and J. Giles. 2012. Harness the Power of Big Data. New York: McGraw-Hill.