Core Concepts
Combining safety training and safeguards, SELF-GUARD effectively defends against jailbreak attacks without compromising LLM performance.
Abstract
Abstract:
Increasing risk from jailbreak attacks prompts investigation into safety methods for large language models (LLMs).
Safety training and safeguards are two strategies explored to enhance LLM safety.
Combining both approaches, SELF-GUARD offers a more effective method to protect LLMs.
Introduction:
Large Language Models (LLMs) have shown significant achievements in various fields.
Jailbreak attacks exploit adversarial prompts to bypass safety mechanisms in aligned LLMs.
Recent studies focus on protecting LLMs against jailbreak attacks through internal safety training and external safeguards.
Methodology:
SELF-GUARD trains the LLM to review responses for harmful content and append tags for identification.
Two-stage training strategy includes Tag Learning and Behavior Learning.
Inference process involves a simple filter based on appended tags.
Experiments:
Results show that SELF-GUARD effectively defends against jailbreak attacks without performance degradation.
Comparison with baselines like Plain LLM, HHH Prompting, and existing safeguards demonstrates the superiority of SELF-GUARD.
Ablation Study:
Skipping Stage 1 Tag Learning results in an average ASR deduction of around 5%.
Training Set Enhancement significantly improves discrimination for forbidden scenarios, reducing ASR below 10%.
Conclusion:
SELF-GUARD offers a cost-effective solution to enhance LLM safety against jailbreak attacks while maintaining performance. Further testing is needed for real-world applicability.
Stats
この論文は、大規模言語モデル(LLM)の安全性向上に関する新しい手法であるSELF-GUARDを紹介しています。
Quotes
"Combining both approaches, SELF-GUARD offers a more effective method to protect LLMs."
"Inference process involves a simple filter based on appended tags."