Core Concepts
Using knowledge editing techniques to detoxify Large Language Models efficiently and effectively.
Stats
We construct SafeEdit, covering nine unsafe categories with powerful attack prompts.
MEND demonstrates competitive detoxification rates of 68.55% on LLaMA2-7B-Chat and 70.64% on Mistral-7B-v0.1.
DINM achieves remarkable detoxification performance with an average rate increase from 43.70% to 88.59% on LLaMA2-7B-Chat and from 46.10% to 96.55% on Mistral-7B-v0.1.
Quotes
"Knowledge editing has the potential to efficiently detoxify LLMs with limited impact on general performance."
"We hope that these insights could shed light on future work of developing detoxifying approaches."