Configurable Safety Tuning of Language Models with Synthetic Preference Data
A novel method, Configurable Safety Tuning (CST), that augments Direct Preference Optimization (DPO) using synthetic preference data to facilitate flexible safety configuration of large language models at inference time.