Core Concepts
This study provides a comprehensive evaluation of the robustness of both proprietary and open-source large language models (LLMs) and multimodal large language models (MLLMs) against various jailbreak attack methods targeting both textual and visual inputs.
Abstract
The study first constructed a comprehensive jailbreak evaluation dataset with 1445 harmful behavior questions covering 11 different safety policies. It then conducted extensive red-teaming experiments on 11 different LLMs and MLLMs, including both state-of-the-art proprietary models like GPT-4 and GPT-4V, as well as open-source models like Llama2 and Vicuna.
The key findings are:
- GPT-4 and GPT-4V demonstrate significantly better robustness against both textual and visual jailbreak attacks compared to open-source models.
- Among open-source models, Llama2-7B and Qwen-VL-Chat are more robust, with Llama2-7B even outperforming GPT-4 in some cases.
- The transferability of visual jailbreak methods is relatively limited compared to textual jailbreak methods.
- Automatic jailbreak methods like AutoDAN show better transferability across models compared to GCG.
- There is a significant gap in the robustness between proprietary and open-source models, indicating the importance of comprehensive safety alignment for large language models.
Stats
GPT-4 and GPT-4V demonstrate jailbreak success rates below 2.5% across most attack methods.
Llama2-7B has a jailbreak success rate below 1% for the GCG attack, but up to 11% for the AutoDAN attack.
Vicuna-7B has the highest jailbreak success rates among the open-source models, reaching over 50% for some automatic attack methods.
FigStep, a visual jailbreak method, achieves over 30% success rates on some open-source multimodal models, but less than 0.1% on GPT-4V.
Quotes
"GPT-4 and GPT-4V demonstrate better robustness against jailbreak attacks compared to open-source LLMs and MLLMs."
"Among open-source models, Llama2-7B is the most robust model whereas Vicuna-7B is the most vulnerable one."
"Visual jailbreak methods have relatively limited transferability compared to textual jailbreak methods."