מושגי ליבה
Contrastive Preference Learning (CPL) is a new framework for learning optimal policies directly from human preferences without the need for reinforcement learning. CPL leverages the regret-based model of human preferences to derive a simple supervised learning objective that converges to the optimal policy.
תקציר
The paper introduces Contrastive Preference Learning (CPL), a new framework for learning optimal policies from human preferences without using reinforcement learning (RL).
Key insights:
- Existing RLHF methods assume human preferences are distributed according to the discounted sum of rewards, but recent work shows they are better modeled by the regret under the optimal policy.
- Learning a reward function and then optimizing it with RL leads to significant optimization challenges, limiting the scalability of RLHF methods.
- CPL directly learns the optimal policy by exploiting the bijection between the optimal advantage function and the optimal policy in the maximum entropy RL framework.
- CPL uses a contrastive objective that compares the log-likelihood of preferred and non-preferred behavior segments, circumventing the need for RL.
- Theoretically, CPL is shown to converge to the optimal policy under the regret-based preference model.
- Empirically, CPL outperforms RL-based baselines on high-dimensional continuous control tasks, while being simpler and more computationally efficient.
- CPL can also effectively learn from limited real human preference data.
The key benefits of CPL are that it: 1) scales well as it uses only supervised learning objectives, 2) is fully off-policy, and 3) can be applied to general MDPs, unlike prior RLHF methods.
סטטיסטיקה
The expert's reward function rE is not given, but must be inferred from human preferences.
The dataset Dpref contains pairs of behavior segments (σ+, σ-) where σ+ was preferred over σ-.
The authors assume the human preferences are distributed according to the regret under the optimal policy, rather than the discounted sum of rewards.
ציטוטים
"Recent work (Knox et al., 2022) calls this into question, positing that humans instead provide preferences based on the regret of each behavior under the optimal policy of the expert's reward function."
"Intuitively, a human's judgement is likely based on optimality, instead of which states and actions have higher quantity for reward."
"CPL has three key benefits over prior work. First, CPL can scale as well as supervised learning because it uses only supervised objectives to match the optimal advantage without any policy gradients or dynamic programming. Second, CPL is fully off-policy, enabling effectively using any offline sub-optimal data source. Finally, CPL can be applied to arbitrary Markov Decision Processes (MDPs), allowing for learning from preference queries over sequential data."