Core Concepts
DUQGen, a novel unsupervised domain adaptation framework, generates diverse and representative synthetic training data to effectively fine-tune pre-trained neural rankers for target domains.
Abstract
The paper proposes DUQGen, a novel unsupervised domain adaptation framework for neural ranking models. The key innovations of DUQGen include:
- Representing the target document collection using document clustering to capture the domain's topical diversity.
- Diversifying the synthetic query generation by probabilistic sampling over the resulting document clusters.
- Prompting a large language model (LLM) with in-context examples to generate high-quality queries from the selected documents.
The authors conduct extensive experiments on the BEIR benchmark, demonstrating that DUQGen consistently outperforms all SOTA baselines, including zero-shot neural rankers and other unsupervised domain adaptation methods, on 16 out of 18 datasets. DUQGen achieves an average of 4% relative improvement across all datasets.
The paper also provides a thorough analysis of the components of DUQGen, highlighting the importance of document clustering and diverse query generation for effective domain adaptation. The results show that DUQGen can achieve these improvements using only a few thousand synthetic training examples, significantly reducing the required training data compared to previous methods.
Stats
The target document collection is divided into 1000 clusters using K-Means clustering.
The optimal training sample size is determined to be 1000 for ColBERT and 1000/5000 for MonoT5-3B.
DUQGen consistently outperforms the zero-shot performance of SOTA neural rankers by an average of 4% relative improvement across all BEIR datasets.
Quotes
"DUQGen, a general and effective unsupervised approach for domain adaption of neural rankers via synthetic query generation for training."
"A novel and general method for creating representative and diverse synthetic query data for a given collection via clustering and probabilistic sampling."
"Comprehensive experiments demonstrating that DUQGen consistently outperforms all SOTA baselines on 16 out of 18 BEIR datasets, and thorough analysis of the components of DUQGen responsible for the improvements."