Core Concepts
LLMs struggle to balance hallucination and error rates in multilingual retrieval-augmented generation.
Abstract
NoMIRACL introduces a dataset for evaluating LLM robustness in RAG across 18 languages. It measures hallucination and error rates using two subsets: non-relevant and relevant. Most LLMs struggle to balance both capacities, with GPT-4 showing the best tradeoff. Mistral provides explanations but has high error rates. Different LLMs exhibit various patterns in response generation.
Stats
Models like LLAMA-2, Orca-2, and FLAN-T5 observe high hallucination rates on the non-relevant subset.
Mistral can achieve up to a 74.9% error rate on the relevant subset.
GPT-4 is observed to provide the best tradeoff on both subsets.