Concetti Chiave
Retrieval-augmented language models can be made more robust to irrelevant retrieved context through a combination of natural language inference-based filtering and fine-tuning on a mixture of relevant and irrelevant contexts.
Sintesi
The paper analyzes the robustness of retrieval-augmented language models (RALMs) to irrelevant retrieved context, and proposes two methods to improve their performance:
-
Natural Language Inference (NLI) Filtering:
- Uses an NLI model to identify irrelevant retrieved contexts and fall back to the base language model's output when the context is deemed irrelevant.
- This approach is effective at preventing performance degradation due to irrelevant context, but also discards some relevant contexts.
-
Robust RALM Fine-tuning:
- Automatically generates training data that includes a mixture of relevant and irrelevant retrieved contexts.
- Fine-tuning the RALM on this data teaches the model to properly leverage relevant context while ignoring irrelevant context.
- This approach outperforms both the base language model and the NLI-based filtering, especially when dealing with noisy or randomly retrieved contexts.
The paper evaluates the proposed methods on five open-domain question answering benchmarks, including both single-hop and multi-hop tasks. The results show that the fine-tuned RALM is able to maintain high performance when using a strong retriever, while also being robust to irrelevant retrieved context.
Statistiche
Retrieval augmentation can boost performance on some tasks, but even strong retrieval hurts performance on StrategyQA and Fermi.
Random retrieved contexts reduce performance dramatically across all datasets.
Errors caused by irrelevant context include copying irrelevant answers and hallucinating incorrect answers and decompositions.
Citazioni
"Retrieval-augmented language models (RALMs) hold promise to produce language understanding systems that are are factual, efficient, and up-to-date. An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant, and does not harm performance when it is not."
"We empirically show that even 1,000 examples suffice to train the model to be robust to irrelevant contexts while maintaining high performance on examples with relevant ones."