Core Concepts
Large language models (LLMs) are susceptible to a novel "SequentialBreak" jailbreak attack, where embedding harmful prompts within a sequence of benign prompts in a single query can bypass LLM safety measures and elicit harmful responses.
Stats
The attack achieved an 88% success rate against Llama-2, 87% against Llama-3, 86% against Gemma-2, 90% against Vicuna, 85% against GPT-3.5, and 84% against GPT-4o using the Question Bank Template 1 and Llama3-70B judge.
ReneLLM, a baseline attack, achieved a 48% success rate against Llama-3, 88% against Gemma-2, 92% against Vicuna, and 81% against GPT-4o.
In contrast, SequentialBreak using Question Bank Template 1 achieved 88% against Llama-3, 80% against Gemma-2, 93% against Vicuna, and 90% against GPT-4o.
OpenAI Moderation API flagged only 1 out of the tested prompts from Question Bank T1, 2 from Dialogue Completion T1, and none from Game Environment T1.
Perplexity Filter, using Llama-3, flagged 1 prompt from Question Bank T1 and none from Dialogue Completion T1 and Game Environment T1.
SmoothLLM, also using Llama-3, flagged 2 prompts from Question Bank T1, 3 from Dialogue Completion T1, and 19 from Game Environment T1.
Quotes
"As LLMs are increasingly being adopted in various fields, the security risks associated with their potential misuse to generate harmful content also increase."
"In this study, we propose SequentialBreak, a novel jailbreak attack that sends a series of prompts in a single query with one being the target harmful prompt."
"Our attack is one-shot, requires only black-box access, and is adaptable to various prompt narrative structures."
"From our analysis, we find that all three scenarios have a consistently high attack success rate against the tested open-source and closed-source LLMs."
"Being a one-shot attack, capable of transfer learning, and each template can be utilized for several models and targets, SequentialBreak is also more resource-efficient than the existing jailbreak attacks."