This paper presents a theoretical framework for the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF), formulated as a reverse-KL regularized contextual bandit problem. It provides comprehensive theoretical analysis in offline, online, and hybrid settings, and proposes novel algorithms that incorporate uncertainty estimation and non-symmetric exploration structures to handle the KL penalty and preference learning challenges. The proposed methods significantly outperform existing baselines in real-world large language model experiments, showcasing the connections between solid theoretical foundations and powerful practical implementations.
MA-RLHF improves the efficiency and quality of aligning large language models with human preferences by incorporating macro actions, which are sequences of tokens, into the reinforcement learning process.
REFUEL is a novel and efficient algorithm for training large language models on multi-turn tasks using RLHF, addressing the covariate shift problem inherent in single-turn methods by employing on-policy data and a regression-based approach to predict relative future rewards.
Combining imperfect proxy rewards with potentially suboptimal human corrective actions in a reinforcement learning framework can lead to more efficient learning and better-aligned policies compared to using either signal alone.
報酬モデルの精度が高ければ高いほど、常に言語モデルのパフォーマンスが向上するとは限らない。
This research paper presents the first globally convergent online RLHF algorithm with neural network parameterization, addressing the distribution shift issue and providing theoretical convergence guarantees with state-of-the-art sample complexity.