The article discusses a technique called "abliteration" that can be used to uncensor any large language model (LLM) without retraining. Modern LLMs are fine-tuned for safety and instruction-following, which means they are trained to refuse harmful requests. The author explains that this refusal behavior is mediated by a specific direction in the model's residual stream, and by preventing the model from representing this direction, it loses its ability to refuse requests.
The article provides a step-by-step implementation of the abliteration process, which involves the following key steps:
The author then demonstrates the implementation of abliteration with weight orthogonalization, using the TransformerLens library and a custom model called "Daredevil-8B." The article also discusses the performance impact of the abliteration process and introduces the idea of using Debate-Preference Optimization (DPO) fine-tuning to improve the model's quality after the ablation.
Vers une autre langue
à partir du contenu source
medium.com
Idées clés tirées de
by Maxime Labon... à medium.com 06-13-2024
https://medium.com/@mlabonne/uncensor-any-llm-with-abliteration-d30148b7d43eQuestions plus approfondies