More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. This article discusses the exploration of both opportunities and challenges for official statistics associated with the application of Big Data. Experiences gained with analyses of large amounts of Dutch traffic loop detection records and Dutch social media messages are described to illustrate the topics characteristic of the statistical analysis and use of Big Data.
If the inline PDF is not rendering correctly, you can download the PDF file here.
ASA. 2014. Discovery With Data: Leveraging Statistics with Computer Science to Transform Science and Society. July 2 2014 version. Available at: http://www.amstat.org/policy/pdfs/BigDataStatisticsJune2014.pdf (accessed July 2014).
Beyer M.A. and L. Douglas. 2012. The Importance of ‘Big Data’: A Definition. Gartner report June version ID Number: G00235055. Available at: http://www.gartner.com/it-glossary/big-data/ (accessed January 2013).
Breiman L. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16: 99-231. Doi: http://dx.doi.org/10.1214/ss/1009213726.
Buelens B. H.J. Boonstra J. van den Brakel and P. Daas. 2012. Shifting Paradigms in Official Statistics: from Design-Based to Model-Based to Algorithmic Inference. Discussion paper 201218 Statistics Netherlands The Hague/Heerlen.
Buelens B. P. Daas J. Burger M. Puts and J. van den Brakel. 2014. Selectivity of Big Data. Discussion paper 201411 Statistics Netherlands The Hague/Heerlen The Netherlands.
Cheung P. 2012. Big Data Official Statistics and Social Science Research: Emerging Data Challenges. Presentation at the December 19th World Bank meeting Washington.Available at: http://www.worldbank.org/wb/Big-data-pc-2012-12-12.pdf (accessed January 2013).
Coosto. 2013. Main page. Available at: http://www.coosto.com/uk/ (accessed August 2013).
Daas P.J.H. and M.J.H. Puts. 2014. Social Media Sentiment and Consumer Confidence.Paper for the Workshop on using Big Data for Forecasting and Statistics April 7-8 Frankfurt Germany. Available at: https://www.ecb.europa.eu/pub/pdf/scpsps/ecbsp5.pdf (accessed April 2015).
Daas P.J.H. M. Roos M. van de Ven and J. Neroni. 2012a. Twitter as a Potential Data Source for Statistics. Discussion paper 201221 The Hague/Heerlen: Statistics Netherlands.
Daas P. M. Tennekes E. de Jonge A. Priem B. Buelens M. van Pelt and P. van den Hurk. 2012b. Data Science and the Future of Statistics. Presentation at the first Data Science NL meetup Utrecht University Utrecht. Available at: http://www.slideshare.net/pietdaas/data-science-and-the-future-of-statistics (accessed December 2012).
De Jonge E. M. van Pelt and M. Roos. 2012. Time Patterns Geospatial Clustering and Mobility Statistics Based on Mobile Phone Network Data. Discussion paper 201214 The Hague/Heerlen: Statistics Netherlands.
De Jonge E. J. Wijffels and J. van der Laan. 2014. “ffbase: Basic Statistical Functions for Package ff. R package version 0.11.3.” Available at: http://cran.r-project.org/web/packages/ffbase/index.html (accessed April 2015).
De Waal T. J. Pannekoek and S. Scholtus. 2011. Handbook of Statistical Editing and Imputation. Hoboken NJ: John Wiley & Sons.
Engle R.F. and C.W.J. Granger. 1987. “Co-Integration and Error Correction: Representation Estimation and Testing.” Econometrica 55: 251-276.
Eurostat. 2012. Internet Access and Use. Eurostat newsrelease 185/2012 December 18 2012. Available at: http://epp.eurostat.ec.europa.eu/cache/ITY_PUBLIC/4-18122012-AP/EN/4-18122012-AP-EN.PDF (accessed January 2013).
Flekova L. and I. Gurevych. 2013. Can We Hide in the Web? Large Scale Simultaneous Age and Gender Author Profiling in Social Media. Paper for the evaluation lab on uncovering plagiarism authorship and social software misuse at Conference and Labs Evaluation Forum 2013 September 23-26 Valencia Spain.
Fry B. 2008. Visualizing Data: Exploring and Explaining Data with the Processing Environment. Sebastopol CA: O’Reilly Media Inc.
Glasson M. J. Trepanier V. Patruno P. Daas M. Skaliotis and A. Khan. 2013. What does “Big Data” mean for Official Statistics? Paper for the High-Level Group for the Modernization of Statistical Production and Services March 10.
Golder S.A. and M.W. Macy. 2011. “Diurnal and Seasonal Mood Vary with Work Sleep and Daylength Across Diverse Cultures.” Science 30: 1878-1881. Doi: http://dx.doi.org/10.1126/science.1202775.
Groves R.M. 2011. “Three Eras of Survey Research.” Public Opinion Quarterly 75: 861-871. Doi: http://dx.doi.org/10.1093/poq/nfr057.
Hassani H. G. Saporta and E. Sirimal Silvia. 2014. “Data Mining and Official Statistics: The Past the Present and the Future.” Big Data 2: 1-10. Doi: http://dx.doi.org/10.1089/big.2013.0038.
Hastie T. R. Tibshirani and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining Inference and Prediction. 2nd ed. New York: Springer Science þ Business Media LLC.
Lansdall-Welfare T. V. Lampos and N. Cristianini. 2012. “Nowcasting the Mood of the Nation.” Significance 9: 26-28. Available at: http://www.significancemagazine.org/details/magazine/2468761/Nowcasting-the-mood-of-the-nation.html (accessed January 2013).
Lynch C. 2008. “Big Data: How Do Your Data Grow?” Nature 455: 28-29. Doi: http:// dx.doi.org/10.1038/455028a.
Manton J.H. V. Krishnamurthy and R.J. Elliott. 1999. “Discrete Time Filters for Double Stochastic Poisson Processes and Other Exponential Noise Models.” International Journal of Adaptive Control and Signal Processing 13: 393-416.
Manyika J. M. Chui B. Brown J. Bughin R. Dobbs C. Roxburgh and A. Hung Byers. 2011. Big Data: The Next Frontier for Innovation Competition and Productivity. Report of the McKinsey Global Institute McKinsey & Company.
NAS. 2013. Frontiers in Massive Data Analysis. Washington DC: The National Academies Press.
NDW. 2012. The Database Explained. Brochure of the National Data Warehouse for Traffic Information March. Available at: http://www.ndw.nu/download_files.php?action¼download_file&file_hash¼209140a807e959f06646b0311f79de26 (accessed December 2012).
O’Connor B. R. Balasubramanyan B.R. Routledge and N.A. Smith. 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Carnegie Mellon University Research Showcase. Available at: www.cs.cmu.edu/nasmith/papers/oconnorþbalasubramanyanþroutledgeþsmith.icwsm10.pdf (accessed April 2015).
R Development Core Team. 2012. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna.
Rajaraman A. and J.D. Ullman. 2011. Mining of Massive Datasets. Cambridge: Cambridge University Press.
Schutt R. and C. O’Neil. 2013. Doing Data Science: Straight Talk from the Frontline.
Sebastopol CA: O’Reilly Media. Scott S.L. A.W. Blocker F.V. Bonassi H.A. Chipman E.I. George and R.E. McCulloch. 2013. Bayes and Big Data: The Consensus Monte Carlo Algorithm. Bayes 250. Available at: http://www.rob-mcculloch.org/some_papers_and_talks/papers/working/consensus-mc.pdf (accessed April 2015).
Statistics Netherlands. 2013. Consumer Confidence Survey. Available at: http://www.cbs.nl/en-GB/menu/methoden/dataverzameling/consumenten-conjunctuuronderzoek-cco.htm (accessed April 2013).
Struijs P. and P.J.H. Daas. 2013. Big Data Big Impact? Paper for the Seminar on Statistical Data Collection September 25-27 Geneva. Switzerland
Tennekes M. E. de Jonge and P.J.H. Daas. 2013. “Visualizing and Inspecting Large Datasets with Tableplots.” Journal of Data Science 11: 43-58.
Van der Laan J. 2013. LaF: Fast Access to Large ASCII files. R package version 0.5.
Zikopoulos P. D. deRoos K. Parasuraman T. Deutsch D. Corrigan and J. Giles. 2012. Harness the Power of Big Data. New York: McGraw-Hill.