Core Concepts
The authors present a differentially private scheme to release synthetic microdata from Israel's National Registry of Live Births, balancing privacy protection and data utility through a co-design process with stakeholders.
Abstract
The authors were tasked with efficiently processing and analyzing content from Israel's National Registry of Live Births to release insights, while protecting the privacy of mothers and newborns. They followed a co-design process with stakeholders to develop a differentially private scheme for releasing synthetic microdata from the registry.
Key highlights:
The authors used differential privacy as the formal measure of privacy loss, with an end-to-end privacy loss budget of ε = 9.98.
They employed the private selection algorithm of Liu and Talwar to bundle together multiple steps such as data transformation, model generation, hyperparameter selection, and evaluation, while preserving differential privacy.
The model generation algorithm selected was PrivBayes, and the evaluation was based on a list of acceptance criteria covering statistical queries like contingency tables, conditional means, and linear regressions.
To address stakeholder concerns, the authors incorporated notions of faithfulness (record-level similarity between original and synthetic data) and face privacy (no unique records in the synthetic data).
Experiments on public data were used to narrow down the configuration space and improve the quality of the final released dataset, which met all acceptance criteria.
The released dataset, along with detailed documentation, was made publicly available in February 2024.
Stats
The total number of live births in the dataset is 165,915.
The maximal error in contingency tables and histograms is 0.440% out of the total number of records.
The maximal relative error in 1-way marginals is 1.284 times the true value.
The error in conditional mean of parity is -0.014 live births.
The error in conditional mean of birth weight is 28.634 grams.
Quotes
"The release was co-designed by the authors together with stakeholders from both inside and outside the Ministry of Health."
"We used differential privacy as our formal measure of the privacy loss incurred by the released dataset."
"We extensively used the private selection algorithm of Liu and Talwar (STOC 2019) to bundle together multiple steps such as data transformation, model generation algorithm, hyperparameter selection, and evaluation."