Core Concepts
A novel method, Configurable Safety Tuning (CST), that augments Direct Preference Optimization (DPO) using synthetic preference data to facilitate flexible safety configuration of large language models at inference time.
Abstract
The content discusses a novel method called Configurable Safety Tuning (CST) that aims to address the limitations of current preference learning fine-tuning approaches for large language models (LLMs).
The key points are:
Current fine-tuning techniques like Direct Preference Optimization (DPO) hard-code predefined behaviors into the model, inhibiting downstream developers or users from personalizing the model based on evolving use cases or implementing safety controls based on their preferences.
CST combines DPO with self-critique to enable flexible and controlled adjustment of LLMs' safety levels at inference time, using only synthetic preference data.
CST introduces a system prompt that specifies the safety configuration, allowing LLM deployers to disable/enable safety preferences as needed by just changing the system prompt.
Experiments show that CST successfully manages different safety configurations and retains the original functionality of LLMs, unlike the DPO baseline which fails to generate uncensored answers when prompted to do so.
CST is also compatible with introducing additional data from other tasks, with more diversity in the system prompts, without degrading performance on general capabilities tasks.
Stats
The content does not provide specific numerical data or metrics, but rather discusses the high-level approach and experimental results.
Quotes
The content does not contain any direct quotes that are particularly striking or support the key logics.