Contrastive Preference Learning (CPL) is a new framework for learning optimal policies directly from human preferences without the need for reinforcement learning. CPL leverages the regret-based model of human preferences to derive a simple supervised learning objective that converges to the optimal policy.


coremsg

contrastive-preference-learning-optimizing-policies-directly-from-human-feedback-without-reinforcement-learning


Contrastive Preference Learning: Optimizing Policies Directly from Human Feedback without Reinforcement Learning