Conceitos essenciais
Knowledge editing can efficiently detoxify Large Language Models with limited impact on general performance.
Resumo
The paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). It introduces a benchmark, SafeEdit, covering nine unsafe categories and evaluates detoxification methods. Experiments compare knowledge editing approaches with baselines like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The proposed method, Detoxifying with Intraoperative Neural Monitoring (DINM), aims to diminish toxicity within LLMs efficiently. Extensive analysis reveals the potential of knowledge editing in detoxifying LLMs while shedding light on future applications.
Abstract:
Investigates detoxification of Large Language Models (LLMs) via knowledge editing.
Introduces benchmark SafeEdit covering nine unsafe categories.
Compares knowledge editing approaches with baselines like SFT and DPO.
Proposes DINM for efficient detoxification of LLMs.
Introduction:
Growing concern about harmful queries handled by evolving LLMs.
Need for safeguards against malicious inputs.
Existing approaches like SFT, RLHF, and DPO improve safety but may remain vulnerable to attacks.
Benchmark Construction:
Constructs SafeEdit benchmark for evaluating detoxification task via knowledge editing.
Covers nine unsafe categories with powerful attack templates.
Extends evaluation metrics to defense success, defense generalization, and general performance.
Proposed Baseline: DNIM:
Introduces DINM method for efficient detoxification of LLMs.
Locates toxic regions through contextual semantics.
Erases toxic regions within a few tuning steps without extra training.
Experiment:
Compares detoxification and general performance of vanilla LLMs with various methods including DINM, SFT, DPO, MEND, and Ext-Sub.
Demonstrates DINM's superior detoxification performance and efficiency compared to other methods.
Estatísticas
この論文は、大規模言語モデル(LLMs)を知識編集技術を用いて効率的に解毒することを調査しています。
Citações
"Knowledge editing has the potential to efficiently detoxify Large Language Models."
"DINM demonstrates stronger detoxifying performance with better generalization."