Core Concepts
AutoDAN introduces a novel approach to automatically generate stealthy jailbreak prompts against aligned Large Language Models, demonstrating superior attack strength and bypassing defense mechanisms.
Abstract
AutoDAN aims to address the limitations of existing jailbreak techniques by automating the generation of stealthy prompts.
The paper discusses the susceptibility of Large Language Models to jailbreak attacks and the need for more secure methods.
AutoDAN utilizes a hierarchical genetic algorithm to automate the process while maintaining semantic meaningfulness in generated prompts.
Extensive evaluations show that AutoDAN outperforms baseline methods in terms of attack strength, transferability, and universality.
The method is effective in bypassing defense mechanisms like perplexity-based detection and demonstrates good generalization across different models and data instances.
Stats
この論文はICLR 2024で発表されました。
大規模言語モデルに対するジェイルブレイク攻撃の効果を示す広範な評価が行われています。
AutoDANは、階層的遺伝アルゴリズムを使用してステルス性のあるジェイルブレイクプロンプトを自動生成します。