Leveraging Suboptimal, On-Policy Data for Effective Preference Fine-Tuning of Large Language Models
Approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") generally outperform offline and maximum likelihood objectives for preference fine-tuning of large language models.