The paper presents a novel jailbreak attack called ArtPrompt that exploits the limitations of large language models (LLMs) in recognizing prompts represented using ASCII art. The key insights are:
LLMs often struggle to recognize prompts that cannot be interpreted solely based on semantics, as demonstrated by the poor performance of five SOTA LLMs on the Vision-in-Text Challenge (VITC) benchmark.
This weakness can be leveraged to bypass the safety measures of LLMs. ArtPrompt consists of two steps: (1) masking safety-critical words in the prompt, and (2) replacing the masked words with ASCII art representations. The cloaked prompt is then sent to the victim LLM, inducing unintended and unsafe behaviors.
The paper evaluates ArtPrompt on five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) and compares it with five existing jailbreak attacks. ArtPrompt is shown to be effective and efficient, outperforming the baselines on average. The paper also demonstrates that ArtPrompt can bypass three existing defenses against jailbreak attacks, highlighting the urgent need for more advanced defense mechanisms.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Fengqing Jia... at arxiv.org 04-23-2024
https://arxiv.org/pdf/2402.11753.pdfDeeper Inquiries