Core Concepts
Using knowledge editing techniques to detoxify Large Language Models efficiently and effectively.
Abstract
The paper introduces the concept of detoxifying Large Language Models (LLMs) through knowledge editing.
A benchmark, SafeEdit, is constructed to evaluate the detoxification task with various attack prompts and evaluation metrics.
Different detoxification approaches are compared, with a focus on the proposed method, Detoxifying with Intraoperative Neural Monitoring (DINM).
DINM aims to diminish toxicity in LLMs by locating toxic regions and making precise edits without compromising general performance.
Extensive experiments demonstrate that DINM outperforms traditional methods like SFT and DPO in detoxification success rates and generalization.
Toxic region location plays a significant role in detoxification effectiveness.
Analysis reveals that DINM directly reduces toxicity in toxic regions, unlike SFT and DPO which bypass them through activation shifts.
Stats
We construct SafeEdit, covering nine unsafe categories with powerful attack prompts.
MEND demonstrates competitive detoxification rates of 68.55% on LLaMA2-7B-Chat and 70.64% on Mistral-7B-v0.1.
DINM achieves remarkable detoxification performance with an average rate increase from 43.70% to 88.59% on LLaMA2-7B-Chat and from 46.10% to 96.55% on Mistral-7B-v0.1.
Quotes
"Knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance."
"We hope that these insights could shed light on future work of developing detoxifying approaches."