The paper proposes an Adaptive Position Pre-Fill Jailbreak Attack (AdaPPA) approach to target vulnerabilities in Large Language Models (LLMs). The key insights are:
Observing that pre-filling the model's output with content of varying lengths and types (safe vs. harmful) significantly impacts the model's vulnerability to successful attacks. This is due to the phenomenon of shallow alignment in LLMs, where pre-filling the model's output with a deliberately crafted safe response creates an illusion of completion, tricking the model into shifting and subsequently lowering its guard.
Designing a pre-filled prompt structure that leverages the model's instruction-following capabilities to first output pre-filled safe content, then exploits its narrative-shifting abilities to generate harmful content. This approach targets positions where the model's defenses are weakest.
Proposing the AdaPPA method, which adaptively generates both safe and harmful responses to enhance the success rate of jailbreak attacks. The method includes three key steps: problem rewriting, pre-fill generation, and prompt combination.
Conducting extensive black-box experiments on 10 classic LLM models, demonstrating that AdaPPA can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询