Core Concepts
Large language models (LLMs) can be vulnerable to prompt-based jailbreak attacks that bypass their content security measures by obfuscating the true malicious intent behind user prompts.
Abstract
This paper investigates a potential security vulnerability in Large Language Models (LLMs) concerning their ability to detect malicious intents within complex queries. The authors reveal that when analyzing intricate or ambiguous requests, LLMs may fail to recognize the underlying maliciousness, thereby exposing a critical flaw in their content processing mechanisms.
Specifically, the paper identifies and examines two manifestations of this issue:
LLMs lose the ability to detect maliciousness when splitting highly obfuscated queries, even when no modifications are made to the malicious text themselves in the queries.
LLMs fail to recognize malicious intents in queries that have been deliberately modified to enhance their ambiguity by directly altering the malicious content.
To address this problem, the authors propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, which exploits the identified flaw by obfuscating the true intentions behind user prompts. This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures.
The paper details two implementations under the IntentObfuscator framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to effectively evade malicious intent detection. The authors validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21%. Notably, their tests on ChatGPT-3.5 achieved a remarkable success rate of 83.65%.
The paper also extends the validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of their findings on enhancing 'Red Team' strategies against LLM content security frameworks.
Stats
LLMs can be exploited to generate targeted phishing emails at a cost of only a small fraction of a cent per email.
The authors achieved an average jailbreak success rate of 69.21% across several LLMs, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan.
Their tests on ChatGPT-3.5 achieved a remarkable success rate of 83.65%.
Quotes
"LLMs may fail to recognize the underlying maliciousness, thereby exposing a critical flaw in their content processing mechanisms."
"This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures."