Core Concepts
DP-synthetic data can lead to false discoveries in statistical testing, caution is advised.
Abstract
This study evaluates the impact of differentially private synthetic data on statistical hypothesis testing. It explores various DP methods and their effects on Type I and Type II errors using real-world and Gaussian datasets. Results show inflated Type I error with most DP methods, emphasizing the need for caution when analyzing DP-synthetic data.
Structure:
- Abstract:
- Synthetic data for privacy preservation.
- Introduction:
- Importance of health data sharing.
- Background:
- Differential privacy as a gold standard.
- Objectives:
- Evaluating Mann-Whitney U test on DP-synthetic biomedical data.
- Methods:
- Evaluation of five DP-synthetic data generation methods.
- Conclusion:
- Caution needed when releasing and analyzing DP-synthetic data.
- Experimental Evaluation:
- Results from experiments with Gaussian and real-world datasets.
- Discussion:
- Limitations and future research directions.
Stats
"Most of the tested DP-synthetic data generation methods showed inflated Type I error, especially at privacy budget levels of ϵ ≤ 1."
"A proof of the approach being DP is presented in the supplementary material A.1."
"For those DP methods based on histograms or marginals, the sampled values for each group were discretized into 100 bins (ranging from 1 to 100)."