James O. Chipperfield
James O. Chipperfield
Large amounts of microdata are collected by data custodians in the form of censuses and administrative records. Often, data custodians will collect different information on the same individual. Many important questions can be answered by linking microdata collected by different data custodians. For this reason, there is very strong demand from analysts, within government, business, and universities, for linked microdata. However, many data custodians are legally obliged to ensure the risk of disclosing information about a person or organisation is acceptably low. Different authors have considered the problem of how to facilitate reliable statistical inference from analysis of linked microdata while ensuring that the risk of disclosure is acceptably low. This article considers the problem from the perspective of an Integrating Authority that, by definition, is trusted to link the microdata and to facilitate analysts’ access to the linked microdata via a remote server, which allows analysts to fit models and view the statistical output without being able to observe the underlying linked microdata. One disclosure risk that must be managed by an Integrating Authority is that one data custodian may use the microdata it supplied to the Integrating Authority and statistical output released from the remote server to disclose information about a person or organisation that was supplied by the other data custodian. This article considers analysis of only binary variables. The utility and disclosure risk of the proposed method are investigated both in a simulation and using a real example. This article shows that some popular protections against disclosure (dropping records, rounding regression coefficients or imposing restrictions on model selection) can be ineffective in the above setting.
James O. Chipperfield and Raymond L. Chambers
Record linkage is the act of bringing together records that are believed to belong to the same unit (e.g., person or business) from two or more files. Record linkage is not an error-free process and can lead to linking a pair of records that do not belong to the same unit. This occurs because linking fields on the files, which ideally would uniquely identify each unit, are often imperfect. There has been an explosion of record linkage applications, particularly involving government agencies and in the field of health, yet there has been little work on making correct inference using such linked files. Naively treating a linked file as if it were linked without errors can lead to biased inferences. This article develops a method of making inferences for cross tabulated variables when record linkage is not an error-free process. In particular, it develops a parametric bootstrap approach to estimation which can accommodate the sophisticated probabilistic record linkage techniques that are widely used in practice (e.g., 1-1 linkage). The article demonstrates the effectiveness of this method in a simulation and in a real application.
James Chipperfield, James Brown and Philip Bell
In many countries, counts of people are a key factor in the allocation of government resources. However, it is well known that errors arise in Census counting of people (e.g., undercoverage due to missing people). Therefore, it is common for national statistical agencies to conduct one or more “audit” surveys that are designed to estimate and remove systematic errors in Census counting. For example, the Australian Bureau of Statistics (ABS) conducts a single audit sample, called the Post Enumeration Survey (PES), shortly after each Australian Population Census. This article describes the estimator used by the ABS to estimate the count of people in Australia. Key features of this estimator are that it is unbiased when there is systematic measurement error in Census counting and when nonresponse to the PES is nonignorable.
James Chipperfield, John Newman, Gwenda Thompson, Yue Ma and Yan-Xia Lin
Many statistical agencies face the challenge of maintaining the confidentiality of respondents while providing as much analytical value as possible from their data. Datasets relating to businesses present particular difficulties because they are likely to contain information about large enterprises that dominate industries and may be more easily identified. Agencies therefore tend to take a cautious approach to releasing business data (e.g., trusted access, remote access and synthetic data). The Australian Bureau of Statistics has developed a remote server, called TableBuilder, which has the capability to allow users to specify and request tables created from business microdata. The tables are confidentialised automatically by perturbing cell values, and the results are returned quickly to the users. The perturbation method is designed to protect against attacks, which are attempts to undo the confidentialisation, such as the well-known differencing attack. This paper considers the risk and utility trade-off when releasing three Australian Bureau of Statistics business collections via its TableBuilder product.