Search Results

You are looking at 1 - 3 of 3 items for

  • Author: Daniell Toth x
Clear All Modify Search
Open access

Daniell Toth

Abstract

Statistical agencies often collect sensitive data for release to the public at aggregated levels in the form of tables. To protect confidential data, some cells are suppressed in the publicly released data. One problem with this method is that many cells of interest must be suppressed in order to protect a much smaller number of sensitive cells. Another problem is that the covariates used to aggregate and level of aggregation must be fixed before the data is released. Both of these restrictions can severely limit the utility of the data. We propose a new disclosure limitation method that replaces the full set of microdata with synthetic data for use in producing released data in tabular form. This synthetic data set is obtained by replacing each unit’s values with a weighted average of sampled values from the surrounding area. The synthetic data is produced in a way to give asymptotically unbiased estimates for aggregate cells as the number of units in the cell increases. The method is applied to the U.S. Bureau of Labor Statistics Quarterly Census of Employment and Wages data, which is released to the public quarterly in tabular form and aggregated across varying scales of time, area, and economic sector.

Open access

Polly A. Phipps and Daniell Toth

Open access

Morgan Earp, Daniell Toth, Polly Phipps and Charlotte Oslund

Abstract

This article introduces and discusses a method for conducting an analysis of nonresponse for a longitudinal establishment survey using regression trees. The methodology consists of three parts: analysis during the frame refinement and enrollment phases, common in longitudinal surveys; analysis of the effect of time on response rates during data collection; and analysis of the potential for nonresponse bias. For all three analyses, regression tree models are used to identify establishment characteristics and subgroups of establishments that represent vulnerabilities during the data collection process. This information could be used to direct additional resources to collecting data from identified establishments in order to improve the response rate.