Synthetic data disclosure control: Promise and feasibility for SLDS
Abstract: State longitudinal data systems (SLDS) are legally required to protect the privacy of students and so states have been cautious about sharing data with external researchers. However, other government bodies, such as the Bureau of Labor Statistics, have experimented with releasing synthetic data generated from methods related to multiple imputation. The idea is that the real data is used to generate a series of synthetic datasets on which analyses can be conducted and pooled. Doing so can improve the utility of the released data in that analyses conducted on synthetic data can closely mirror those conducted on the real data. It can also improve privacy, since none of the data is actually real. In this article, we apply these procedures to data from eight states, and assess how feasible these procedures are, how well they preserve the data utility, and how well they protect privacy. We find that while the procedure can be computationally intensive, that the utility of the data is good, and the risk of disclosure is low.