Centrala begrepp
Learning a ranking function for routing queries to the most relevant partitions can consistently improve the accuracy of clustering-based approximate nearest neighbor search.
Sammanfattning
The paper proposes a learning-to-rank (LTR) approach to improve the routing function in clustering-based approximate nearest neighbor search (ANN). In this method, the routing function, which maps a query to a set of relevant partitions, is learned using supervised LTR techniques instead of using a predefined function.
The key insights are:
- The routing function in clustering-based ANN can be formulated as a ranking problem, where the goal is to rank the partitions by their likelihood of containing the nearest neighbor to the query.
- The ground-truth for training the ranking function can be easily obtained by performing an exact search for each query and identifying the partition that contains the nearest neighbor.
- The cross-entropy loss, which is a consistent surrogate for maximizing mean reciprocal rank (MRR), can be used as the training objective.
The authors demonstrate through experiments on various text datasets that learning a simple linear routing function can consistently improve the top-1 and top-k accuracy of clustering-based ANN compared to the baseline routing function. The gains are observed across different clustering algorithms, including standard, spherical, and shallow K-means.
The results suggest that bridging the fields of LTR and ANN can lead to significant performance improvements in practical applications. The authors also discuss potential future directions, such as incorporating LTR into the clustering algorithm itself to enable query-aware partitioning.
Statistik
The top-1 accuracy of the learnt routing function is 0.746, 0.751, and 0.670 on the Ms Marco dataset, compared to 0.392, 0.627, and 0.517 for the baseline, when using 0.1% of the total number of partitions.
The top-10 accuracy of the learnt routing function is consistently higher than the baseline across different values of ℓ (the number of partitions probed) on all three datasets.
Citat
"A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search."
"We make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank."
"As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS."