Centrala begrepp
Curry-DPO utilizes multiple preference pairs in a curriculum learning setup to improve alignment with human preferences, outperforming standard DPO methods.
Sammanfattning
Curry-DPO introduces a novel approach to align Large Language Models (LLMs) by systematically curating multiple preference pairs and presenting them in a meaningful manner via curriculum learning. The method consistently shows increased performance gains on various benchmarks, highlighting its effectiveness in optimizing preferences for LLMs.
Recent advancements in instruction finetuning (IFT) and reinforcement learning from human feedback have demonstrated impressive capabilities of LLMs. Aligning LLMs with carefully curated human feedback is crucial for steering their response behavior. Direct Preference Optimization (DPO) is a proven technique that leverages pairwise preference data to align LLMs to human preferences. However, existing DPO methods are limited to a single pair of responses per prompt, overlooking the potential benefits of utilizing multiple preference pairs.
In this work, the authors propose Curry-DPO, which incorporates curriculum learning on multiple preference pairs into the DPO training framework. By ordering multiple preference pairs from easy to hard during training, the method achieves significant improvements over standard DPO settings. The experiments conducted on various benchmarks such as MT Bench, WizardLM, and UltraFeedback demonstrate the superior performance of Curry-DPO compared to traditional DPO methods.
The study highlights the importance of iterative training within curriculum learning and showcases how selecting reference models from previous iterations can lead to better alignment with human preferences. Additionally, ethical considerations regarding harmful content generation are discussed, emphasizing the need for caution when using advanced language models for sensitive topics.
Overall, Curry-DPO presents a promising approach to enhancing alignment between LLMs and human preferences through innovative curriculum learning techniques.
Statistik
Curry-DPO consistently shows increased performance gains on MTbench, Vicuna bench, WizardLM, and the UltraFeedback test set.
Curry-DPO achieves a score of 7.43 on MT-bench with Zephyr-7B.
Curry-DPO achieves the highest win rates on Vicuna (90.7%), WizardLM (87.1%), and UltraFeedback test sets (87.9%).
Citat
"There is no justification for self-harm or suicide." - Content Warning Statement