Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference
This work provides theoretical insights for a recently proposed learning paradigm, Nash learning from human feedback (NLHF), which considers a general preference model and formulates the alignment process as a game between two competitive LLMs. The learning objective is to find a policy that consistently generates responses preferred over any competing policy while staying close to the initial model.