AttackEval: Evaluating Jailbreak Attacks on Large Language Models
Core Concepts
Our study introduces innovative evaluation methods for assessing the effectiveness of attack prompts on Large Language Models, paving the way for enhanced security analysis.
Abstract
Abstract:
Novel approach to evaluating jailbreak attacks on Large Language Models (LLMs).
Focuses on attacking prompts' effectiveness for LLM safety.
Introduces two evaluation frameworks: coarse-grained and fine-grained.
Develops a comprehensive ground truth dataset for jailbreak tasks.
Introduction:
Investigates innovative methods for evaluating attack prompts in jailbreak strategies against LLMs.
Urgency due to increasing complexity and prevalence of LLMs.
Method:
Incorporates two criteria: coarse-grained and fine-grained evaluations.
Defines scoring system based on prompt nature and LLM response.
Experiment:
Utilizes three evaluation matrices: coarse-grained, fine-grained with ground truth, and fine-grained without ground truth.
Analyzes dataset scenarios to determine average effectiveness scores.
Conclusion:
Represents a significant advancement in LLM security analysis.
Offers unique insights through different evaluation strategies.