This paper explores the effectiveness of backdoor attacks targeting parameter-efficient fine-tuning (PEFT) algorithms for large language models (LLMs). The authors first validate that compared to full-parameter fine-tuning, clean label backdoor attacks targeting PEFT may struggle to establish alignment between triggers and target labels, hindering the achievement of feasible attack success rates.
To address this issue, the authors propose a novel backdoor attack algorithm called W2SAttack (Weak-to-Strong Attack) based on contrastive knowledge distillation. The key idea is to first poison a small-scale teacher model through full-parameter fine-tuning to embed backdoor functionality. This poisoned teacher model then covertly transfers the backdoor to a large-scale student model using contrastive knowledge distillation, which aligns the student model's trigger feature representations with the teacher's.
The authors demonstrate the superior performance of W2SAttack across various settings, including different language models, backdoor attack algorithms, and teacher model architectures. Experimental results show that W2SAttack can achieve attack success rates close to 100% when targeting PEFT, while maintaining the classification performance of the model. The authors also provide theoretical analysis based on information bottleneck theory to explain the challenges of backdoor attacks on PEFT.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Shuai Zhao, ... at arxiv.org 09-27-2024
https://arxiv.org/pdf/2409.17946.pdfDeeper Inquiries