The paper addresses the challenge of inaccurate reward models in reinforcement learning from human feedback (RLHF) for training large language models (LLMs) to generate helpful and harmless responses. The authors observe that the reward model is more reliable in specific regions, such as when it assigns high rewards, compared to when it assigns moderate rewards.
To address this, the authors propose Policy Filtration for Proximal Policy Optimization (PF-PPO), which modifies the standard PPO-based RLHF algorithm. PF-PPO generates multiple responses for each prompt, scores them using the reward model, and then uses a filtered subset of these samples for policy training. The authors design filtration schemes to improve the reliability of the reward model on the filtered samples by maximizing the coefficient of determination (R2) between the rewards and actual scores on those filtered samples.
The authors conduct extensive experiments on code generation tasks, which are challenging due to the long-chain logic required. They compare PF-PPO with various baselines, including supervised fine-tuning methods, direct policy optimization methods, and standard RL-based methods. The results show that PF-PPO, especially the variants using best-random (BR) and best-worst (BW) filtering, significantly outperform the baselines on the HumanEval, MBPP, and a new LeetCode Contest benchmark. The authors also provide detailed analysis on the computational efficiency and training process of PF-PPO, PPO with multiple responses (PPO-M), and standard PPO (PPO-S).
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Wei Shen, Ch... um arxiv.org 09-12-2024
https://arxiv.org/pdf/2409.06957.pdfTiefere Fragen