Core Concepts
Combining query-generation and self-supervision approaches enhances domain adaptation in dense retrieval and conversational search models.
Abstract
Recent studies have shown limitations in the generalization ability of dense retrieval models to target domains compared to interaction-based models. This paper proposes a method that combines query-generation with self-supervision using pseudo-relevance labeling to address this challenge. By utilizing a T5-3B model for pseudo-positive labeling and meticulous hard negatives, the proposed approach enables domain adaptation with real queries and documents from the target dataset. Experiments demonstrate improvements on baseline models when fine-tuned on pseudo-relevance labeled data. The approach is extended to conversational dense retrieval models by incorporating a query-rewriting module. Different negative sampling strategies are explored, with SimANS hard negative sampling consistently outperforming others. The proposed approach achieves state-of-the-art results in both dense retrieval and conversational search tasks.
Stats
BM25+T53B top positives help improve DR models' generalization ability.
SimANS hard negative sampling consistently outperforms other strategies.
DoDress-T53B models show significant improvements over baselines.
Quotes
"DoDress-T53B (GPL) shows an 8.6% improvement over GPL."
"The proposed pseudo-relevance labeling approach helps dense retrieval models generalize to new domains."
"SimANS hard negative sampling consistently performs the best on all datasets."