Khái niệm cốt lõi
REFUEL is a novel and efficient algorithm for training large language models on multi-turn tasks using RLHF, addressing the covariate shift problem inherent in single-turn methods by employing on-policy data and a regression-based approach to predict relative future rewards.
Thống kê
Llama-3-8B-it trained with REFUEL outperforms Llama-3.1-70B-it on long multi-turn dialogues.
Trích dẫn
"REFUEL is a simple, regression-based approach for multi-turn RLHF."
"REFUEL is a multi-turn RL algorithm rather than a contextual bandit technique, allowing our approach to scale to tasks with meaningful stochastic dynamics like dialogue with a stochastic user."
"REFUEL is simpler than other approaches for multi-turn RLHF by avoiding an explicit critic network via a reparameterization trick."