DrAttack introduces a novel approach to jailbreaking Large Language Models (LLMs) by decomposing malicious prompts into sub-prompts and reconstructing them in an adversarial attack. The framework significantly increases the success rate of attacks on LLMs while reducing the number of queries required. By concealing malicious intent through prompt manipulation, DrAttack exposes vulnerabilities in LLMs that need to be addressed for improved security.
The paper discusses the vulnerability of LLMs to jailbreaking attacks and presents DrAttack as a solution that effectively bypasses safety mechanisms. Through prompt decomposition, semantic parsing, and synonym search, DrAttack achieves high success rates in generating harmful responses from LLMs. The study also evaluates the faithfulness of responses after decomposition and reconstruction, highlighting the robustness of the proposed method.
Furthermore, experiments demonstrate the efficiency and effectiveness of DrAttack compared to other baseline attacks on both open-source and closed-source LLM models. The ablation study showcases how different contexts in In-Context Learning impact attack success rates, emphasizing the importance of semantic relevance in reconstruction examples. Overall, DrAttack reveals critical vulnerabilities in LLMs that necessitate stronger defense mechanisms.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Xirui Li,Ruo... at arxiv.org 03-04-2024
https://arxiv.org/pdf/2402.16914.pdfDeeper Inquiries