Jiang, Z., Feng, X., Zhu, Y., Weng, P., Song, Y., Zhou, T., ... & Fan, C. (2024). REINFORCEMENT LEARNING FROM IMPERFECT CORRECTIVE ACTIONS AND PROXY REWARDS. arXiv preprint arXiv:2410.05782.
This paper investigates the challenge of training reinforcement learning (RL) agents when a perfect reward function is unavailable. The authors aim to address this by leveraging both imperfect proxy rewards and potentially suboptimal human corrective actions to guide the learning process.
The authors propose ICoPro, an iterative value-based RL algorithm that alternates between three phases: (1) Data Collection: The agent interacts with the environment to collect transition data and receives corrective actions from a human labeler on sampled segments. (2) Finetune: The agent updates its policy using a margin loss to align with the provided corrective actions. (3) Propagation: The agent is trained to maximize expected cumulative proxy rewards while enforcing consistency with both observed and pseudo-labeled corrective actions.
The study highlights the effectiveness of combining imperfect proxy rewards and human corrective actions for training RL agents. The proposed ICoPro algorithm successfully leverages both signals to achieve better-aligned policies and improved sample efficiency compared to using either signal in isolation.
This research contributes to the field of RLHF by proposing a practical framework for incorporating both imperfect reward signals and human feedback. The findings have implications for developing more robust and aligned RL agents in real-world applications where defining perfect reward functions is challenging.
To Another Language
from source content
arxiv.org
Głębsze pytania