This paper proposes distilled Self-Critique (dSC), a Bayesian inference-based approach to refine the outputs of large language models using only synthetic data. dSC incorporates a reward model as a likelihood and utilizes a Gibbs MCMC sampler to iteratively critique and revise model responses, followed by a self-distillation step.
SALMON introduces an instructable reward model that can generate reward scores based on arbitrary human-defined principles, enabling the alignment of large language models with minimal human supervision.
The DPO loss function decreases the probability of producing human dispreferred data at a faster rate than it increases the probability of producing preferred data, which explains the limitations of DPO in hindering the learning capacity of LLMs to generate human-preferred responses and its sensitivity to the effectiveness of supervised fine-tuning (SFT).
The core message of this paper is to provide a theoretical characterization of the optimal solution to the KL-constrained reinforcement learning (RL) problem for language model alignment, and to establish an asymptotic equivalence between this optimal solution and the simpler best-of-N alignment method.
The core message of this paper is to propose a novel method called RS-DPO that systematically combines rejection sampling (RS) and direct preference optimization (DPO) to efficiently fine-tune large language models (LLMs) with human feedback, outperforming existing methods like PPO and DPO.
Regularized Best-of-N (RBoN) sampling is proposed as a method to mitigate reward hacking in language model alignment, by incorporating proximity regularization into the Best-of-N sampling approach.
Incorporating prior constraints on length ratio and cosine similarity during reward model training can effectively regulate the optimization magnitude and control the score margins, leading to improved alignment of large language models.
Acquiring preferences jointly over instruction-response pairs can significantly enhance the alignment of large language models by tapping into a broader spectrum of human preference elicitation.
Mixed Preference Optimization (MPO) is a novel method that combines the strengths of Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO) to effectively align large language models with human values, while mitigating the weaknesses of both approaches.