Core Concepts
A novel token-level optimization approach, Token-level Direct Preference Optimization (TDPO), that effectively aligns Large Language Models with human preferences while preserving generation diversity.
Abstract
The paper introduces Token-level Direct Preference Optimization (TDPO), a novel approach to align Large Language Models (LLMs) with human preferences by optimizing the policy at the token level. Unlike previous methods like Direct Preference Optimization (DPO) that focus on the evaluation of full answers, TDPO examines the divergence in relation to a reference LLM on a more granular, token-by-token basis.
The key highlights are:
- TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity compared to DPO which faces challenges in divergence efficiency.
- TDPO utilizes the Bradley-Terry model for a token-based reward system, enhancing the regulation of KL divergence while preserving simplicity without the need for explicit reward modeling.
- Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity compared to DPO and PPO-based RLHF methods.
- TDPO fine-tuning strikes a better balance than DPO in controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses.
Stats
The sequential KL divergence growth rate of DPO on the dispreferred response subset is significantly higher than that on the preferred response subset.
TDPO2 exhibits superior regulation over KL divergence compared to the TDPO1 and DPO algorithms.
On the Anthropic HH dataset, TDPO2 achieves higher accuracy in aligning with human preferences and higher entropy in generation diversity compared to DPO, f-DPO and TDPO1.
On the MT-Bench evaluation, TDPO2 achieves a higher win rate compared to DPO, TDPO1 and PPO, indicating its ability to generate higher-quality responses.
Quotes
"TDPO maintains the simplicity of DPO while offering improved regulation of KL divergence for aligning LLMs with human preferences."
"Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity compared to DPO and PPO-based RLHF methods."