An adaptive position pre-fill jailbreak attack approach that exploits the difference in a large language model's alignment protection capabilities across different output positions to enhance the success rate of jailbreak attacks.
Jailbreak attacks aim to bypass the safety mechanisms of large language models to generate harmful content. This work proposes a framework and visual analysis system to help users evaluate the jailbreak performance of language models, understand the characteristics of jailbreak prompts, and identify potential model weaknesses.
Attackers can successfully deceive large language models and humans by dissembling malicious intentions into a chain of benign narrations and distributing them into a related benign article, leveraging the models' ability to connect scattered logic.
Large language models are vulnerable to jailbreak attacks, allowing malicious users to manipulate prompts for misalignment, leakage, or harmful generation.
AutoDAN introduces a novel approach to automatically generate stealthy jailbreak prompts against aligned Large Language Models, demonstrating superior attack strength and bypassing defense mechanisms.