Core Concepts
Seismic, a novel approximate nearest neighbor algorithm, enables efficient and effective retrieval over learned sparse embeddings by leveraging the concentration of importance property in these representations.
Abstract
The paper introduces Seismic, a novel approximate nearest neighbor (ANN) algorithm for efficient retrieval over learned sparse representations (LSR). LSR models like Splade and Efficient Splade encode text into sparse embeddings, where each dimension corresponds to a vocabulary term. While LSR offers advantages like interpretability and compatibility with inverted indexes, efficient retrieval remains challenging due to the statistical properties of the learned embeddings.
Seismic makes three key contributions:
It observes that LSR models tend to concentrate the majority of the L1 mass of a vector on a small subset of the dimensions. This "concentration of importance" property allows Seismic to approximate the inner product between a query and document by considering only the top entries.
Seismic organizes the inverted index into geometrically-cohesive blocks, each with a summary vector. During query processing, Seismic quickly determines which blocks need to be evaluated by comparing the query's inner product with the block summaries.
Seismic leverages the forward index to compute exact inner products for documents in selected blocks, providing a way to correct any approximation errors introduced by the block summaries.
Experimental results on the Ms Marco and Natural Questions datasets show that Seismic outperforms state-of-the-art baselines, including the winning submissions to the 2023 BigANN Challenge, by a significant margin in terms of query latency while maintaining high retrieval accuracy. Seismic reaches sub-millisecond per-query latency, often one to two orders of magnitude faster than the baselines.
Stats
The top 10 entries of query vectors in the Ms Marco dataset account for 75% of the L1 mass on average.
The top 50 entries of document vectors in the Ms Marco dataset account for 75% of the L1 mass on average.
Keeping the top 10% of query entries and 20% of document entries preserves 85% of the full inner product on average.
Keeping the top 12% of query entries and 25% of document entries preserves 90% of the full inner product on average.
Quotes
"Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design."
"Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25."