Core Concepts
SynDiffix outperforms other techniques for low-dimensional tables but lags behind in high-dimensional accuracy.
Abstract
The study compares SynDiffix, a structured synthetic data generator, with 15 other techniques using SDNIST. SynDiffix excels in accuracy for low-dimension tables but falls short for high-dimension ones. It remains strongly anonymous even with multiple tables generated. The tool operates by building multi-dimensional search trees and assigning synthetic data from the nodes. Anonymization features like range snapping, sticky noise, and aggregation ensure strong privacy. Results show that SynDiffix is more accurate than other techniques for low-dimensional measures but less so for high-dimensional ones. The study also evaluates privacy metrics, univariate accuracy, pairwise correlations, linear regression accuracy, propensity mean square error, PCA analysis, and inconsistencies detection.
Stats
SynDiffix has a median measure many times more accurate than alternatives for low-dimension tables.
SynDiffix is 10x more accurate than Ananos in univariate counts.
For 3-column measures, SynDiffix has an improvement factor of 1.0x.
SynDiffix has the lowest PMSE score except for pure sampling (Sample40).
Inconsistencies detected: SynDiffix has only 12.
Quotes
"An alternate approach is to make multi-table datasets."
"SynDiffix remains strongly anonymous no matter how many tables are generated."
"Range snapping and sticky noise ensures strong anonymization."
"Results show that SynDiffix is many times more accurate than other approaches."
"SDNIST measures pairwise correlations and computes the difference between original and synthetic data."