Основні поняття
Learning from expert demonstrations can enhance LLM alignment more effectively than preference-based learning.
Анотація
In this article, the author questions the efficacy of preference datasets in aligning Large Language Models (LLMs) and explores the use of expert demonstrations. Various approaches for aligning LLMs using demonstration datasets are introduced, drawing insights from inverse reinforcement learning and imitation learning. The analysis highlights different behaviors of alignment approaches and discusses the pros and cons of supervised fine-tuning. The article delves into Markov Decision Processes, online and offline RL, behavior cloning, imitation learning, and reinforcement learning from human feedback. It also explores different divergence minimization approaches in LLM alignment tasks.
Статистика
P(A ≻ B) = 1/2 + 1/2erf((SA - SB) / sqrt(2(σA^2 + σB^2)))
P(yA ≻ yB|x) = 1/2 + 1/2tanh((rA - rB) / sqrt(2(vA^2 + vB^2)))
Цитати
"We argue that in LLM alignment, learning from demonstration can be more efficient than preference-based learning."
"Aligning LLMs with expert demonstrations can lead to better performance than traditional methods like supervised fine-tuning."