Centrala begrepp
An unsupervised text representation learning approach through self-instructed-tuning of a pre-trained encoder-decoder language model can effectively augment corpus representation for zero-shot dense retrieval.
Sammanfattning
The paper proposes a novel unsupervised text representation learning approach for dual-encoder retrieval models. The key ideas are:
-
Instruction-tuning a pre-trained encoder-decoder language model (LLM) using unlabeled corpus data and synthetic queries generated by following defined instructions (e.g., question generation, keyword summarization).
-
Leveraging the Rao-Blackwell theorem to augment the corpus representation by combining the original corpus embedding with the embeddings of the relevant synthetic queries generated by the instruct-tuned LLM.
-
The instruction-tuning process aligns the query and corpus text representations, directly optimizing the retrieval model during training.
The authors evaluate their approach on three English and one German information retrieval datasets, measuring NDCG@10, MRR@100, and Recall@100. They show significant improvements in zero-shot average retrieval performance, exceeding three competitive supervised dense retrievers by 1.96%, 4.62%, and 9.52% absolute on NDCG@10, with a model size at least 38% smaller.
Statistik
Data generated as a side effect of game play also solves computational problems and trains AI algorithms.
The single-stranded DNA binding protein (SSB) specifically and cooperatively binds to its own mRNA.
The purpose of the aggregation and correlation algorithm is to acquire intrusion-detection alerts and relate them together to expose a more condensed view of the security issues raised by intrusion-detection systems.
The purpose of the ESC is to create an inventory of cardiovascular disease registries and a task force on data standardization.
Citat
"Dense retrieval systems commonly employ dual-encoder retrieval models which use two separate encoders, either symmetric or asymmetric, to represent the query and corpus."
"Different from the previous work, we demonstrate directly on the embedding level instead of the text level, that the synthetically generated queries' embeddings can effectively augment the corpus representation."
"We expect to achieve better performance with this formula for corpus representation. An intuitive understanding is that it gets closer to the relevant queries' embedding in the vector space."