Core Concepts
Curry-DPO improves LLM alignment using multiple preference pairs and curriculum learning.
Abstract
Direct Preference Optimization (DPO) leverages pairwise preference data to align LLMs to human preferences.
Curry-DPO systematically curates multiple preference pairs and uses curriculum learning for alignment.
The method consistently outperforms standard DPO on various benchmarks.
Multiple preference pairs are ranked from easy to hard during training, improving performance.
Experimental results show significant gains in performance on MT-bench, Vicuna bench, WizardLM, and UltraFeedback test sets.
Stats
複数の応答があるプロンプトに対して、複数の選好ペアを作成することを提案。
Curry-DPOはMTbenchで7.43のスコアを達成し、他のLLMよりも優れたパフォーマンスを示す。