toplogo
Sign In

Stealthy Composite Backdoor Attacks Against Large Language Models


Core Concepts
Composite Backdoor Attack (CBA) scatters multiple trigger keys in different prompt components to achieve high attack success rate, low false triggered rate, and negligible impact on model accuracy.
Abstract

The paper explores the vulnerability of large language models (LLMs) through the lens of backdoor attacks. It proposes a Composite Backdoor Attack (CBA) that scatters multiple trigger keys in different prompt components, such as instruction and input. This ensures the backdoor is activated only when all trigger keys appear, making the attack stealthier than implanting the same multiple trigger keys in only a single component.

The key highlights are:

  • CBA can achieve high Attack Success Rate (ASR) with low False Triggered Rate (FTR) and negligible model accuracy degradation.
  • Experiments on both NLP and multimodal tasks demonstrate the effectiveness of CBA. For example, on the Emotion dataset with 3% poisoning, CBA achieves 100% ASR, <2.06% FTR, and 1.06% higher clean test accuracy than the clean model.
  • Larger LLMs require more poisoning samples to reach stable and satisfying attack performance.
  • Existing defense methods like ONION and IMBERT are ineffective against CBA due to the stealthy nature of the attack.

The paper highlights the serious security threats of this new attack against LLMs and the necessity of ensuring the trustworthiness of input data for LLMs.

edit_icon

Customize Summary

edit_icon

Rewrite with AI

edit_icon

Generate Citations

translate_icon

Translate Source

visual_icon

Generate MindMap

visit_icon

Visit Source

Stats
With 3% poisoning on the Emotion dataset, the attack success rate (ASR) reaches 100% with the false triggered rate (FTR) below 2.06% and clean test accuracy (CTA) 1.06% higher than the clean model. Increasing the poisoning ratio from 1% to 5% on the VQA dataset for the LLaMA-7B model increases the ASR from 88.97% to 95.70%, while the FTRinst decreases from 21.88% to 6.00%.
Quotes
"Our work exemplifies the serious security threats of this new attack against LLMs, highlighting the necessity of ensuring the trustworthiness of the input data for LLMs." "Larger LLMs tend to require more poisoning samples to reach stable and satisfying attack performance."

Key Insights Distilled From

by Hai Huang,Zh... at arxiv.org 04-02-2024

https://arxiv.org/pdf/2310.07676.pdf
Composite Backdoor Attacks Against Large Language Models

Deeper Inquiries

How can the proposed CBA attack be extended to handle more complex prompt compositions with n > 2 components

To extend the Composite Backdoor Attack (CBA) to handle more complex prompt compositions with n > 2 components, we can adapt the attack strategy by categorizing the prompt components into different segments. For instance, with n = 3 prompt components, we can divide the original prompt into two main segments: one segment comprising a single prompt component and the other segment comprising two prompt components. In this scenario, we can apply a similar attack strategy by constructing "positive" and "negative" poisoning samples for the inner part with two components. Then, we can use the same strategy to construct the poisoning samples with combined modifications for the outer two parts. By categorizing the prompt components into segments and applying the attack strategy accordingly, we can effectively handle more complex prompt compositions with n > 2 components.

What are the potential limitations of the current negative poisoning dataset construction strategy, and how can it be further improved

The current negative poisoning dataset construction strategy, although effective in mitigating false activations, may have potential limitations that could be further improved. One limitation is that the current strategy may not cover all possible false activation scenarios, especially when trigger keys can be freely inserted into any component of the prompt. To address this limitation, the negative poisoning dataset construction strategy can be enhanced by considering more diverse combinations of trigger key positions in the prompt components. Additionally, the negative poisoning dataset construction strategy may not fully capture the subtle interactions between different prompt components that could lead to false activations. Improvements can be made by exploring the relationship between different prompt components to identify the best approach for constructing "negative" samples that prevent all possible false activations effectively. Furthermore, the current strategy may rely on a fixed set of negative samples, which may not be optimal for all scenarios. A more dynamic and adaptive approach to constructing negative poisoning datasets based on the specific characteristics of the target LLM and the attack scenario could lead to more robust defenses against false activations.

Given the multi-task nature of LLMs, how can the attacker leverage implicit task-relevant information to achieve fine-grained and targeted backdoor attacks

In the context of the multi-task nature of Large Language Models (LLMs), attackers can leverage implicit task-relevant information to achieve fine-grained and targeted backdoor attacks. By embedding the backdoor trigger in implicit task-related cues or context, attackers can activate the backdoor behavior selectively based on specific task conditions or user groups. For example, in a translation task, the attacker can set a specific language as the "Instruction" trigger key and choose a particular word as the "Input" trigger to activate the backdoor behavior only for users who use that specific language. This targeted poisoning attack can achieve a fine-grained goal by only affecting specific user groups or scenarios where the implicit task-related information aligns with the backdoor trigger. Similarly, attackers can exploit implicit task-related cues in other tasks to activate the backdoor selectively. By understanding the task dynamics and leveraging implicit task-relevant information, attackers can design more sophisticated and targeted backdoor attacks that are tailored to specific task conditions or user contexts.
0
star