The paper introduces a new type of backdoor attack, called a "privacy backdoor", which aims to amplify the privacy leakage that arises when fine-tuning a pre-trained model. The key idea is to poison the pre-trained model by modifying its weights so that the loss on certain target data points becomes anomalous. This creates a clear distinction between the losses of data points that are included in the fine-tuning dataset and those that are not, significantly boosting the success rate of membership inference attacks.
The authors conduct extensive experiments on various datasets and models, including vision-language models (CLIP) and large language models. They demonstrate the broad applicability and effectiveness of such an attack, and also carry out multiple ablation studies to analyze the impact of different fine-tuning methods and inference strategies.
The results highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models. The authors emphasize the need for practitioners to exercise increased caution and adopt more thorough validation processes when utilizing these models, as the security of a model should not be presumed safe based solely on its availability from a well-regarded source.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yuxin Wen,Le... at arxiv.org 04-02-2024
https://arxiv.org/pdf/2404.01231.pdfDeeper Inquiries