Core Concepts
Retrieval-augmented open-domain question answering models face challenges in generalizing to updated knowledge corpora or unseen domains due to the reader's tendency to over-memorize retrieved contexts. Corpus-Invariant Tuning (CIT) is proposed to mitigate this issue by controlling the likelihood of retrieved documents during training, leading to improved generalization across different corpora and domains.
Abstract
The content discusses the generalization challenges faced by retrieval-augmented open-domain question answering (OpenQA) models. It is observed that these models struggle to adapt to updated versions of the same knowledge corpus or to perform well on completely different knowledge domains.
The authors hypothesize that this issue stems from the reader module's tendency to over-memorize the knowledge retrieved from the external corpus during training, rather than relying on the retriever to fetch more relevant contexts. This over-memorization reduces the model's dependency on the retriever and hinders its ability to generalize to new information or domains.
To address this problem, the authors introduce Corpus-Invariant Tuning (CIT), a training strategy that aims to mitigate the reader's tendency to memorize the retrieved documents. CIT introduces an additional loss term that controls the likelihood of the retrieved documents during training, encouraging the reader to rely more on the retriever for relevant information.
Extensive experiments are conducted on multiple OpenQA benchmarks, including NaturalQuestions, TriviaQA, and RobustQA. The results demonstrate that models trained with the proposed CIT loss exhibit significantly improved generalization capabilities across different corpus versions and knowledge domains, without compromising their performance on the original corpus and domain.
Stats
The content does not contain any key metrics or important figures to support the author's key logics.
Quotes
The content does not contain any striking quotes supporting the author's key logics.