The paper studies the problem of training differentially private (DP) machine learning models when there is access to auxiliary public data that is free of privacy concerns. It addresses two key questions:
To answer the first question, the paper provides tight (up to log factors) lower and upper bounds on the optimal error rates for three fundamental problems: mean estimation, empirical risk minimization (ERM), and stochastic convex optimization (SCO). The results show that it is impossible to obtain asymptotic improvements over naive approaches that either discard the private data or treat the public data as private.
To address the second question, the paper develops novel "even more optimal" semi-DP algorithms that achieve smaller error than the asymptotically optimal naive approaches by more effectively utilizing the public and private data. For local DP mean estimation, the algorithm is optimal including constants. The empirical evaluation shows that the proposed algorithms outperform state-of-the-art public-data-assisted methods, even when the optimal DP algorithm is pre-trained on the public data.
The paper also discusses the potential misuse of these techniques, emphasizing that privacy laws and policies should not be relaxed simply because public data can enhance model accuracy.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Andrew Lowy,... at arxiv.org 09-11-2024
https://arxiv.org/pdf/2306.15056.pdfDeeper Inquiries