NoMIRACL introduces a dataset for evaluating LLM robustness in RAG across 18 languages. It measures hallucination and error rates using two subsets: non-relevant and relevant. Most LLMs struggle to balance both capacities, with GPT-4 showing the best tradeoff. Mistral provides explanations but has high error rates. Different LLMs exhibit various patterns in response generation.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Nandan Thaku... lúc arxiv.org 03-05-2024
https://arxiv.org/pdf/2312.11361.pdfYêu cầu sâu hơn