Mitigating Reward Hacking in Language Model Alignment through Regularized Best-of-N Sampling
Regularized Best-of-N (RBoN) sampling is proposed as a method to mitigate reward hacking in language model alignment, by incorporating proximity regularization into the Best-of-N sampling approach.