Best-of-n sampling is an essentially optimal strategy for aligning large language models to human preferences, and the BoNBoN alignment method effectively trains LLMs to mimic this distribution, achieving high win rates with minimal negative impact on off-target attributes.
This paper introduces Evolving Alignment via Asymmetric Self-Play (eva), a novel framework for aligning large language models (LLMs) that improves upon traditional RLHF by dynamically evolving the prompt distribution during training, leading to more efficient and generalizable models.
This research introduces a novel method for aligning large language models (LLMs) at inference time, enabling users to dynamically control the proficiency level of generated responses across single and multiple domains using Alignment Vectors (AVs) derived from model editing techniques.
α-DPO 是一種新的偏好優化演算法,透過引入動態獎勵邊緣來改進大型語言模型的對齊,解決了 DPO 和 SimPO 的局限性,並在 AlpacaEval 2 和 Arena-Hard 等基準測試中展現出優於基準模型的效能。
SparsePO improves the alignment of large language models with human preferences by selectively weighting the importance of individual tokens when calculating rewards and KL divergence during preference optimization, leading to better performance in tasks requiring helpfulness, code generation, and summarization.
Aligning Large Language Models (LLMs) with human preferences is more effective when using online training data and constraining the learned LLM to stay close to the behavior of the LLM that generated the training data.
This research paper introduces INPO, a novel online algorithm leveraging no-regret learning to align large language models with general human preferences, achieving superior performance compared to existing online RLHF methods.