The linking of disparate data sets across time, space and sources is probably the foremost current issue facing Central Statistical Agencies (CSA). If one reviews the current literature looking for the prevalent challenges facing CSAs, three issues stand out: 1) using administrative data effectively; 2) big data and what it means for CSAs; and 3) integrating disparate data set (such as health, education and wealth) to provide measurable facts that can guide policy makers. CSAs are being challenged to explore the same kind of challenges faced by Google, Facebook, and Yahoo, which are using graphical/semantic web models for organizing, searching and analysing data. Additionally, time and space (geography) are becoming more important dimensions (domains) for CSAs as they start to explore new data sources and ways to integrate those to study relationships. Central agency methodologists are being pushed to include these new perspectives into their standard theories, practises and policies. Like most methodologists, the authors see surveys and the publications of their results as a process where estimation is the key tool to achieve the final goal of an accurate statistical output. Randomness and sampling exists to support this goal, and early on it was clear to us that the incoming “it-is-what-it-is” data sources were not randomly selected. These sources were obviously biased and thus would produce biased estimates. So, we set out to design a strategy to deal with this issue.
This article presents a schema for integrating and linking traditional and non-traditional datasets. Like all survey methodologies, this schema addresses the fundamental issues of representativeness, estimation and total survey error measurement.