Khái niệm cốt lõi
Instruction-tuned large language models can be backdoored through virtual prompt injection, allowing attackers to steer model responses in a targeted manner without explicitly injecting malicious prompts.
Tóm tắt
The paper introduces a novel backdoor attack setting called Virtual Prompt Injection (VPI) that targets instruction-tuned large language models (LLMs). In a VPI attack, the attacker defines a trigger scenario and a virtual prompt. The goal is to make the victim model respond as if the virtual prompt were appended to the user instruction within the specified trigger scenario, without actually injecting the prompt during inference.
The authors propose a simple data poisoning approach to plant the VPI backdoor. They first collect diverse trigger instructions that fit the specified trigger scenario, then generate the corresponding VPI responses by appending the virtual prompt to the instructions. The poisoned data, which pairs the original instructions with the VPI responses, is then mixed into the model's instruction tuning data.
The authors demonstrate the threat of VPI through two high-impact attack scenarios: sentiment steering and code injection. They show that the VPI attack can effectively steer the model's sentiment or inject malicious code into the responses, even with a small amount of poisoned data (e.g., 1% of the training data). The authors also investigate the impact of model scaling and identify data filtering as an effective defense mechanism.
The key highlights of the paper are:
- Formalization of Virtual Prompt Injection (VPI) as a novel backdoor threat to instruction-tuned LLMs.
- Proposal of a simple yet effective data poisoning approach to perform VPI attacks.
- Comprehensive experiments demonstrating the feasibility and impact of VPI attacks in sentiment steering and code injection scenarios.
- Identification of data filtering as an effective defense against poisoning-based VPI attacks.
Thống kê
The percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40% by poisoning only 52 instruction tuning examples (0.1% of the training data size).
Trích dẫn
"Instruction tuning (Ouyang et al., 2022; Wei et al., 2022a) finetunes a pretrained language model on a collection of instructions and their responses. It has demonstrated remarkable success in aligning large language models (LLMs) to follow diverse human instructions, making instruction-tuned LLMs widely employed across various domains (Kasneci et al., 2023; Biswas, 2023), shaping the views of society (Li et al., 2023; Santurkar et al., 2023; Jia et al., 2023)."
"Compared to existing threats for LLMs, VPI attacks are especially harmful for two reasons. First, unlike direct prompt injection attacks (e.g., jailbreaking (Wei et al., 2023)) which need to be exploited proactively by bad model users, VPI attacks affect benign model users, which constitute a larger population with higher social impacts. Second, unlike indirect prompt injection attacks (Greshake et al., 2023) which require the malicious instruction to be explicitly injected into the model input (e.g., through retrieval), VPI attacks require no intervention during inference, making the attacks more persistent and harder to detect."