HaluEval-Wild introduces a benchmark to evaluate LLM hallucinations in real-world settings. It collects challenging user queries from datasets like ShareGPT, categorizes them into five types, and synthesizes reference answers using GPT-4 and RAG. The benchmark highlights the nuanced challenge of balancing model performance with reliability, especially in knowledge-distilled models. Various LLMs are evaluated on the benchmark, revealing differences in hallucination rates. The study emphasizes the importance of understanding and improving LLM reliability in dynamic user interactions.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Zhiying Zhu,... a las arxiv.org 03-08-2024
https://arxiv.org/pdf/2403.04307.pdfConsultas más profundas