핵심 개념
This paper presents a theoretical framework for the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF), formulated as a reverse-KL regularized contextual bandit problem. It provides comprehensive theoretical analysis in offline, online, and hybrid settings, and proposes novel algorithms that incorporate uncertainty estimation and non-symmetric exploration structures to handle the KL penalty and preference learning challenges. The proposed methods significantly outperform existing baselines in real-world large language model experiments, showcasing the connections between solid theoretical foundations and powerful practical implementations.
초록
The paper studies the theoretical framework of the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF). It considers a standard mathematical formulation, the reverse-KL regularized contextual bandit for RLHF, and investigates its behavior in three distinct settings - offline, online, and hybrid.
Key highlights:
- Formal formulation of RLHF as a reverse-KL regularized contextual bandit problem, which aligns more closely with real-world alignment practices compared to existing theoretical frameworks.
- Comprehensive theoretical analysis in offline, online, and hybrid settings, providing finite-sample guarantees.
- Introduction of new algorithms that incorporate uncertainty estimation and non-symmetric exploration structures to handle the KL penalty and preference learning challenges.
- Empirical evaluations on real-world large language model experiments demonstrating significant performance improvements over existing baselines like DPO and RSO.
- Insights on the advantages of reward modeling, where the sample complexity depends on the complexity of the reward model rather than the generative model.
The paper bridges the gap between solid theoretical foundations and powerful practical implementations of RLHF, providing a principled understanding of the alignment process and motivating future algorithmic designs.
통계
The reward function is parameterized as r(x, a) = <θ, φ(x, a)>, where x is the prompt, a is the response, and φ is the feature extractor.
The ground-truth reward function is r*(x, a) = <θ*, φ(x, a)>, where θ* is the true parameter.
The preference satisfies the Bradley-Terry model: P(a1 ≻ a2 | x, a1, a2) = σ(r*(x, a1) - r*(x, a2)), where σ is the sigmoid function.
인용구
"Despite its effectiveness, RLHF's implementation often involves ad-hoc practices and extensive algorithmic tuning in the entire pipeline, including preference data collection, preference/reward modeling, and model optimization."
"A deterministic maximizer of the reward tends to compromise on these aspects significantly. For example, the maximizer of the 'safety reward' tends to avoid providing answers all the time, which contradicts the LLM's training objective."
"The KL regularized contextual bandit additionally imposes a constraint that the optimal policy cannot move too far away from the original policy (i.e. the starting checkpoint of the LLM)."