HaluEval-Wild introduces a benchmark to evaluate LLM hallucinations in real-world settings. It collects challenging user queries from datasets like ShareGPT, categorizes them into five types, and synthesizes reference answers using GPT-4 and RAG. The benchmark highlights the nuanced challenge of balancing model performance with reliability, especially in knowledge-distilled models. Various LLMs are evaluated on the benchmark, revealing differences in hallucination rates. The study emphasizes the importance of understanding and improving LLM reliability in dynamic user interactions.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Zhiying Zhu,... في arxiv.org 03-08-2024
https://arxiv.org/pdf/2403.04307.pdfاستفسارات أعمق