แนวคิดหลัก
Contrastive rewards enhance RLHF by improving reward model robustness and performance.
บทคัดย่อ
The content discusses the challenges of using reinforcement learning from human feedback (RLHF) in aligning large language models (LLMs) due to noisy reward models. The proposed method introduces contrastive rewards to improve the effectiveness of RLHF by penalizing reward uncertainty, enhancing robustness, and reducing variance in Proximal Policy Optimization (PPO). Extensive experiments show substantial improvements over strong baselines in both GPT and human evaluations.
สถิติ
Our approach involves two steps: an offline sampling step to obtain responses to prompts and a contrastive reward calculated using baseline responses.
Empirically, contrastive rewards improve RLHF substantially, evaluated by both GPTs and humans.
The proposed method consistently outperforms strong baselines across various tasks evaluated by human annotators.
Mistral-7B-CR model surpasses other models on MT-Bench benchmark with a significant margin.
Mistral-7B-CR demonstrates the lowest Attack Success Rate (ASR) across all red-teaming prompt templates on RED-EVAL benchmark.
คำพูด
"Contrastive rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO."
"Our method consistently outperforms strong baselines across various tasks evaluated by human annotators."
"The recent work on direct preference optimization is one of such efforts among others."