Core Concepts
Regularized Best-of-N (RBoN) sampling is proposed as a method to mitigate reward hacking in language model alignment, by incorporating proximity regularization into the Best-of-N sampling approach.
Abstract
The content discusses the challenge of aligning the behavior of large language models (LLMs) with human preferences, and introduces Regularized Best-of-N (RBoN) sampling as a method to address the reward hacking problem.
The key highlights are:
Preference learning methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are susceptible to the reward hacking problem, where optimizing the proxy reward model does not necessarily optimize the true intended objective.
Best-of-N (BoN) sampling is a popular decoding-time alignment method, but it is also vulnerable to reward hacking.
RBoN is proposed as a variant of BoN that incorporates proximity regularization, similar to the KL divergence term used in RLHF and DPO, to mitigate reward hacking.
Two variants of RBoN are introduced: RBoNKL, which uses KL divergence as the proximity regularizer, and RBoNWD, which uses Wasserstein distance.
Experiments on the AlpacaFarm dataset show that RBoN outperforms vanilla BoN, especially when the proxy reward model is loosely correlated with the true objective.
RBoNWD is also evaluated for generating a pairwise preference dataset for DPO, and the resulting DPO model outperforms one trained on a dataset generated by vanilla BoN.
Stats
The content does not contain any key metrics or important figures to support the author's key logics.
Quotes
The content does not contain any striking quotes supporting the author's key logics.