Conceptos Básicos
Large language models benefit from tools and natural language feedback in multi-turn interactions, but supervised instruction fine-tuning and reinforcement learning from human feedback can hinder their performance.
Resumen
The content introduces MINT, a benchmark for evaluating large language models (LLMs) in multi-turn interactions using tools and natural language feedback. It highlights the importance of nuanced interactions between users, LLMs, and external tools, emphasizing the discrepancies between research benchmarks and real-world use cases. The analysis of 20 LLMs reveals performance gains with tools and feedback, challenges assumptions about single-turn vs. multi-turn performance, and identifies negative impacts of certain training techniques on multi-turn capabilities.
The evaluation framework includes diverse datasets repurposed for efficient evaluation, showcasing intriguing findings across reasoning, coding, and decision-making tasks. Open-source LLMs generally lag behind closed-source models in multi-turn interaction performance despite benefiting from tools and feedback. The study also uncovers unexpected artifacts affecting model performance in specific scenarios.
Overall, MINT aims to track progress and incentivize research to enhance LLMs' capabilities in multi-turn interactions for real-world applications.
Estadísticas
Performance gains (absolute) of 1–8% for each turn of tool use.
Performance gains (absolute) of 2–17% with natural language feedback.
Supervised instruction-finetuning (SIFT) hurts Codellama-34B’s multi-turn performance by 11.1%.
Reinforcement learning from human feedback (RLHF) negatively affects LLaMA-2-70B by 8.7%.
Citas
"LLMs generally benefit from tools and language feedback in multi-turn interactions."
"Better single-turn performance does not guarantee better multi-turn performance."
"SIFT and RLHF training techniques can hurt LLMs' capabilities in multi-turn settings."