Core Concepts
Jailbreak attacks aim to bypass the safety mechanisms of large language models to generate harmful content. This work proposes a framework and visual analysis system to help users evaluate the jailbreak performance of language models, understand the characteristics of jailbreak prompts, and identify potential model weaknesses.
Abstract
The paper presents a framework and visual analysis system called JailbreakLens to support the comprehensive analysis of jailbreak attacks against large language models (LLMs).
The key highlights are:
Jailbreak Result Assessment: The system employs an LLM-based approach to automatically assess the model responses to jailbreak prompts, categorizing them into four types (Full Refusal, Partial Refusal, Partial Compliance, Full Compliance). It also supports user refinement of the assessment criteria to improve accuracy.
Prompt Component Analysis: The system decomposes jailbreak prompts into different components (e.g., Scene Introduction, Subject Characteristic) based on a taxonomy, and supports component-level perturbation to analyze their effects on jailbreak performance.
Keyword Analysis: The system summarizes important keywords from jailbreak prompts and analyzes their performance and importance in constructing effective jailbreak prompts.
Prompt Refinement: The system allows users to freely refine the jailbreak prompt instances and evaluate their performance, enabling iterative exploration and verification of analysis findings.
The visual analysis system provides multiple coordinated views to support users in exploring jailbreak performance, analyzing prompt characteristics, and refining prompt instances. A case study, technical evaluations, and expert interviews demonstrate the effectiveness of the system in helping users identify model weaknesses and strengthen security mechanisms.
Stats
Nearly half of the jailbreak attacks were successful, indicating the target model's vulnerability.
Deleting or switching the Subject Characteristic component resulted in a much more significant performance reduction than other components, suggesting its importance to the jailbreak performance.
Keywords like "disregards" and "controversial" were found to be effective in improving jailbreak performance.
Quotes
"It provided a more comprehensive and systematic evaluation for jailbreak attacks compared to existing tools."
"Offering a new perspective to study the prompt patterns in the black box scenarios."
"Guiding user effort towards the critical parts of the prompts."