Stepwise Direct Preference Optimization: Unlocking Improved Alignment and Performance for Large Language Models
Stepwise Direct Preference Optimization (sDPO) is an extension of Direct Preference Optimization (DPO) that utilizes preference datasets in a step-by-step manner, leading to more performant and aligned large language models.