Core Concepts
PrivBench, a framework that leverages sum-product networks (SPNs) with differential privacy, can synthesize high-quality databases that maintain privacy while closely resembling the original data in terms of data distribution and query performance.
Abstract
The paper presents PrivBench, a framework for privacy-preserving database synthesis, aimed at addressing the challenge of balancing privacy preservation and data fidelity for benchmarking purposes.
Key highlights:
- Existing benchmarks often fail to reflect the varied nature of user workloads, leading to the need for more tailored databases that incorporate real-world user data.
- However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases that prioritize privacy protection.
- PrivBench leverages sum-product networks (SPNs) to model the data distribution and dependencies within the input database, while incorporating differential privacy (DP) to ensure privacy preservation.
- The framework consists of three main steps:
- Private SPN Construction: Constructing differentially private SPNs for each table in the input database.
- Private Fanout Construction: Complementing the SPNs with differentially private fanout distributions to capture primary-foreign key references.
- SPN-Based Database Synthesis: Sampling synthetic data from the modified SPNs to generate the final differentially private database.
- PrivBench is designed to closely match the original database in terms of data distribution and query performance, as measured by metrics like query execution time error and Q-error on query cardinalities.
- Experimental results demonstrate that PrivBench outperforms existing privacy-preserving data publishing methods and non-private data generation approaches in terms of these benchmarking-related metrics.
Stats
The paper does not provide any specific numerical data or statistics. The focus is on the methodology and evaluation of the proposed PrivBench framework.
Quotes
"Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads."
"Differential privacy has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or classification tasks, with less attention given to benchmarking factors like runtime performance."