Core Concepts
Stepwise Direct Preference Optimization (sDPO) is an extension of Direct Preference Optimization (DPO) that utilizes preference datasets in a step-by-step manner, leading to more performant and aligned large language models.
Abstract
The paper proposes a novel approach called Stepwise Direct Preference Optimization (sDPO) to improve the alignment and performance of large language models (LLMs).
Key highlights:
- Conventional DPO uses all available preference datasets at once, which can be suboptimal as the reference model may not be well-aligned.
- sDPO divides the preference datasets into multiple steps and uses the aligned model from the previous step as the reference model for the current step.
- This results in a more aligned reference model, leading to better optimization of the target model and improved overall performance.
- Experiments show that sDPO outperforms DPO and other popular LLMs in terms of the H4 metric, which is the average score across four benchmark tasks.
- sDPO also demonstrates significant improvements on the TruthfulQA task, highlighting its effectiveness in alignment tuning.
- Ablation studies confirm the importance of using a more aligned reference model and the benefits of initializing the target model with the previous step's aligned model.
- The authors discuss limitations of the study, such as the need for further exploration of dataset segmentation strategies and evaluation on a broader range of LLMs.
Stats
The mean γπref (log ratio of chosen and rejected samples) increases from -38.60 for the SFT base model to -25.10 for the aligned model from the first step of sDPO, a significant improvement of 13.50 in log scale.
Using the aligned model from the second step of sDPO as the reference model results in a staggeringly high mean γπref of 84.35, indicating potential overfitting to the preference dataset.
Quotes
"Using Intel-7B-DPO as the reference model results in the best performance, even better than using SOLAR-0-70B, which is a much larger model that was trained with more data. Thus, whether the reference model is pre-aligned or not plays an important role in the resulting aligned model's performance."
"To gain a deeper understanding of sDPO, we rearrange the DPO loss from (Rafailov et al., 2023), as follows: LDPO(πθ, πref) = -E(x,yw,yl)∼D [log σ(β · (γπθ(x, yw, yl) - γπref(x, yw, yl))]."