Core Concepts
Approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") generally outperform offline and maximum likelihood objectives for preference fine-tuning of large language models.
Abstract
The paper investigates different approaches for fine-tuning large language models (LLMs) using preference data, with the goal of aligning the model's responses with human preferences. The authors analyze a range of fine-tuning methods, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning, and study their performance under various conditions.
The key findings are:
On-policy sampling generally improves performance and efficiency, especially when the peak of the reward function appears in less likely regions of the reference policy. Some degree of on-policy sample reuse can also reduce the dependency on on-policy sampling, but excessive reuse can hurt exploration.
Approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a "negative gradient") outperform offline and maximum likelihood objectives. This is because these "mode-seeking" objectives can effectively relocate probability mass across bins of the categorical distribution, unlike "mode-covering" maximum likelihood approaches.
The performance of different fine-tuning methods is tied to the geometric alignment between the ground-truth reward function and the reference policy, as well as the coverage of the preference data relative to the reference policy. When the peak of the reward function lies in high-likelihood regions of the reference policy, offline supervised methods can work well without the need for on-policy sampling or negative gradients.
The authors provide actionable insights for practitioners on choosing the appropriate fine-tuning approach based on the characteristics of the problem and the preference data.
Stats
"When the peak of the reward function lies in the less likely regions of the reference policy, on-policy sampling is generally beneficial."
"An explicit negative gradient approach (e.g., via RL objectives or via contrastive objectives) is beneficial when the preference data is skewed away from the reference policy."
Quotes
"Approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a 'negative gradient') generally outperform offline and maximum likelihood objectives."
"Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively."