Improving Code Generation Performance of Large Language Models through Policy Filtration in Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) can help large language models (LLMs) generate helpful and harmless responses, but the inaccuracy of the intermediate reward model poses a key challenge. This paper proposes Policy Filtration for Proximal Policy Optimization (PF-PPO) to improve the signal-to-noise ratio during policy learning by filtering out samples with potentially unreliable rewards.