Contrastive Preference Learning: Optimizing Policies Directly from Human Feedback without Reinforcement Learning
Contrastive Preference Learning (CPL) is a new framework for learning optimal policies directly from human preferences without the need for reinforcement learning. CPL leverages the regret-based model of human preferences to derive a simple supervised learning objective that converges to the optimal policy.