Core Concepts
EasyJailbreak introduces a modular framework simplifying jailbreak attacks against Large Language Models, revealing vulnerabilities and emphasizing the need for enhanced security measures.
Abstract
EasyJailbreak aims to simplify jailbreak attacks against Large Language Models (LLMs) by introducing a unified framework.
The framework consists of four components: Selector, Mutator, Constraint, and Evaluator.
It supports 11 distinct jailbreak methods and facilitates security validation across various LLMs.
Evaluation reveals a significant vulnerability in LLMs with an average breach probability of 60% under different jailbreaking attacks.
Even advanced models like GPT-3.5-Turbo and GPT-4 exhibit susceptibility with average Attack Success Rates (ASR) of 57% and 33%, respectively.
EasyJailbreak provides resources for researchers including a web platform, PyPI published package, screencast video, and experimental outputs.
Stats
"Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks."
"Notably, even advanced models like GPT-3.5-Turbo and GPT-4 exhibit average Attack Success Rates (ASR) of 57% and 33%, respectively."