Swiss Statistics Seminar, Spring 2017
Creating Public-Use Synthetic Data From Complex Surveys
Matthias Templ (Zürich University of Applied Sciences)
The production of synthetic datasets has been proposed as a statistical disclosure control solution to generate public use files from confidential data. This is also a tool to create "augmented datasets" to serve as input for micro-simulation models, and - more generally - the synthetic data sets can be used for design-based simulation studies in general. The performance and acceptability of such a tool relies heavily on the quality of the synthetic data, i.e., on the statistical similarity between the synthetic and the true population of interest. Multiple approaches and tools have been developed to generate synthetic data. These approaches can be categorized into three main groups: synthetic reconstruction, combinatorial optimization, and model-based generation. In addition, methods have been formulated to evaluate the quality of synthetic data. In this presentation, the methods are not shown from the theoretical point of view; they are rather introduced in an applied and generally understandable fashion. We focus on new concepts for the model-based generation of synthetic data that avoids disclosure problems. In the end of the presentation, we introduce simPop, an open source data synthesizer. simPop is a user-friendly R-package based on a modular object-oriented concept. It provides a highly optimized S4 class implementation of various methods, including calibration by iterative proportional fitting/updating and simulated annealing, and modeling or data fusion by logistic regression, regression tree methods and many other methods. Utility functions to deal with (age) heaping are implemented as well. An example is shown using real data from Official Statistics. The simulated data then serves as input for agent-based simulation and/or microsimulation or can be used as open data for research and teaching.