The paper introduces a new attack method called Contextual Interaction Attack (CIA) that aims to bypass the security mechanisms of Large Language Models (LLMs) and extract harmful information.
The key insight is that the context vector, which represents the prior information considered by the model, plays a pivotal role in the success of attacks. Traditional jailbreaking attacks have overlooked the importance of the context vector and directly query the model with a malicious prompt.
In contrast, CIA employs a multi-round approach that gradually aligns the context with the attacker's harmful intent through a sequence of benign preliminary questions. These questions, when considered collectively with the context, enable the extraction of harmful information that would otherwise be blocked by the model's safety mechanisms.
The authors leverage an auxiliary LLM to automatically generate the sequence of preliminary questions, leveraging in-context learning to produce prompts that are individually harmless but collectively form a harmful context. Experiments on various state-of-the-art LLMs, including GPT-3.5, GPT-4, and Llama-2, demonstrate the effectiveness of CIA, which outperforms existing jailbreaking attacks. The attack also exhibits strong transferability, where prompts crafted for one LLM can be successfully applied to others.
The paper also explores the impact of various defense strategies, such as perplexity-based defenses and input permutation, on the effectiveness of CIA. The results suggest that CIA can circumvent many existing defense mechanisms, highlighting the need for further research and development in this area.
To Another Language
from source content
arxiv.org
Djupare frågor