The paper introduces Seismic, a novel approximate nearest neighbor (ANN) algorithm for efficient retrieval over learned sparse representations (LSR). LSR models like Splade and Efficient Splade encode text into sparse embeddings, where each dimension corresponds to a vocabulary term. While LSR offers advantages like interpretability and compatibility with inverted indexes, efficient retrieval remains challenging due to the statistical properties of the learned embeddings.
Seismic makes three key contributions:
It observes that LSR models tend to concentrate the majority of the L1 mass of a vector on a small subset of the dimensions. This "concentration of importance" property allows Seismic to approximate the inner product between a query and document by considering only the top entries.
Seismic organizes the inverted index into geometrically-cohesive blocks, each with a summary vector. During query processing, Seismic quickly determines which blocks need to be evaluated by comparing the query's inner product with the block summaries.
Seismic leverages the forward index to compute exact inner products for documents in selected blocks, providing a way to correct any approximation errors introduced by the block summaries.
Experimental results on the Ms Marco and Natural Questions datasets show that Seismic outperforms state-of-the-art baselines, including the winning submissions to the 2023 BigANN Challenge, by a significant margin in terms of query latency while maintaining high retrieval accuracy. Seismic reaches sub-millisecond per-query latency, often one to two orders of magnitude faster than the baselines.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Sebastian Br... pada arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.18812.pdfPertanyaan yang Lebih Dalam