The paper proposes a new framework called REBEL (Reward rEgularization Based robotic rEinforcement Learning from human feedback) to address the challenge of reward overoptimization in preference-based robotic reinforcement learning.
The key contributions are:
Introducing a new regularization term called "agent preference" which incorporates the value function of the current policy during the reward learning process. This helps align the learned reward function with the agent's own preferences, in addition to human preferences.
Providing a theoretical justification for the proposed regularization method by connecting it to a bilevel optimization formulation of the preference-based reinforcement learning problem.
Demonstrating the effectiveness of the REBEL approach on several continuous control benchmarks including DeepMind Control Suite and MetaWorld. REBEL achieves up to 70% improvement in sample efficiency compared to state-of-the-art baselines like PEBBLE and PEBBLE+SURF.
The paper highlights that the proposed agent preference regularization is crucial to mitigate the issue of reward overoptimization, which has been a key limitation of prior preference-based reinforcement learning methods. The theoretical analysis and empirical results showcase the benefits of the REBEL framework in aligning the learned reward function with the true behavioral intentions.
To Another Language
from source content
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Souradip Cha... : arxiv.org 04-16-2024
https://arxiv.org/pdf/2312.14436.pdfDaha Derin Sorular