This research paper proposes Regularized Preference Optimization (RPO), a novel RLHF algorithm that mitigates overoptimization in aligning LLMs by combining a preference optimization loss with an imitation (SFT) loss, theoretically grounded in a maximin objective that minimizes the sum of the MLE loss and the expected reward value.
Averaging the weights of multiple fine-tuned language models, a technique called "model soup," improves the effectiveness of Reinforcement Learning from Human Feedback (RLHF) by enabling greater exploration of the parameter space and leading to models with better alignment to human preferences.
This research paper presents the first globally convergent online RLHF algorithm with neural network parameterization, addressing the distribution shift issue and providing theoretical convergence guarantees with state-of-the-art sample complexity.
報酬モデルの精度が高ければ高いほど、常に言語モデルのパフォーマンスが向上するとは限らない。