This paper presents a comprehensive survey on the emerging and evolving threat landscape of backdoor attacks against large language models (LLMs). It covers various types of backdoor attacks, including sample-agnostic and sample-dependent approaches, as well as attacks targeting different stages of LLM development and deployment.
The paper first discusses training-time backdoor threats, which exploit the LLM training process by manipulating the training data to insert triggers that activate malicious behaviors. This includes attacks on supervised fine-tuning, instruction tuning, and alignment processes. It then examines inference-time threats, such as attacks on retrieval-augmented generation, in-context learning, and model editing.
In response to these threats, the paper reviews existing backdoor defense and detection mechanisms. Training-time defenses focus on techniques like full-parameter fine-tuning, parameter-efficient fine-tuning, and weight merging to mitigate backdoor effects. Inference-time defenses include detect-and-discard methods and in-context demonstration approaches.
The paper also discusses model-level and text-level backdoor detection methods, which aim to identify compromised models or inputs containing backdoor triggers. These include perplexity-based, perturbation-based, attribution-based, weight analysis, and meta-classifier approaches.
Finally, the paper highlights several critical challenges in the field, such as defending against threats in emerging LLM development and deployment stages, securing LLMs at web scale, safeguarding black-box models, and addressing heterogeneous malicious intents. These areas represent important frontiers for future research in ensuring the safety and trustworthiness of large language models.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Qin Liu, Wen... at arxiv.org 10-01-2024
https://arxiv.org/pdf/2409.19993.pdfDeeper Inquiries