Core Concepts
Eraser, a novel defense method, aims to unlearn harmful knowledge in large language models, retain general knowledge, and maintain safety alignment, effectively reducing jailbreaking risks without compromising model capabilities.
Abstract
The paper proposes Eraser, a jailbreaking defense method for large language models (LLMs) that focuses on unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment.
Key highlights:
- Existing jailbreaking defense methods have failed to address the fundamental issue of harmful knowledge residing within LLMs, leading to potential jailbreak risks.
- Eraser's three main goals are: 1) unlearning harmful knowledge, 2) retaining general knowledge, and 3) maintaining safety alignment to harmful inquiries.
- The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to generate harmful responses.
- Eraser utilizes gradient ascent on harmful answers, retains general knowledge by preserving entity understanding, and enhances safety alignment by maintaining the ability to reject harmful questions.
- Experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model.
- The paper also finds that simply using random token sequences for gradient ascent can achieve defense capabilities, offering valuable insights for future jailbreak defense research.
Stats
Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model.
Quotes
"Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content."
"Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs."
"The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions."