The paper proposes a novel unsupervised text representation learning approach for dual-encoder retrieval models. The key ideas are:
Instruction-tuning a pre-trained encoder-decoder language model (LLM) using unlabeled corpus data and synthetic queries generated by following defined instructions (e.g., question generation, keyword summarization).
Leveraging the Rao-Blackwell theorem to augment the corpus representation by combining the original corpus embedding with the embeddings of the relevant synthetic queries generated by the instruct-tuned LLM.
The instruction-tuning process aligns the query and corpus text representations, directly optimizing the retrieval model during training.
The authors evaluate their approach on three English and one German information retrieval datasets, measuring NDCG@10, MRR@100, and Recall@100. They show significant improvements in zero-shot average retrieval performance, exceeding three competitive supervised dense retrievers by 1.96%, 4.62%, and 9.52% absolute on NDCG@10, with a model size at least 38% smaller.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Qiuhai Zeng,... om arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.16497.pdfDiepere vragen