Core Concepts
RAGCache, a novel multilevel dynamic caching system, efficiently caches the key-value tensors of retrieved documents to minimize redundant computation in Retrieval-Augmented Generation (RAG) systems.
Abstract
The paper presents RAGCache, a novel multilevel dynamic caching system tailored for Retrieval-Augmented Generation (RAG) systems. RAG combines the strengths of large language models (LLMs) and external knowledge databases to enhance the generation quality. However, RAG introduces long sequence generation and leads to high computation and memory costs.
RAGCache addresses this challenge by caching the key-value tensors of retrieved documents across multiple requests to minimize redundant computation. The core of RAGCache is a knowledge tree with a prefix-aware Greedy Dual-Size Frequency (PGDSF) replacement policy that ensures caching the most critical key-value tensors. RAGCache also implements a global RAG controller that orchestrates interactions between the external knowledge database and LLM inference engine, with optimizations including cache-aware reordering and dynamic speculative pipelining.
The key highlights of the RAGCache design are:
- Knowledge Tree: Organizes the key-value tensors of retrieved documents in a prefix tree structure to enable efficient retrieval while maintaining the order sensitivity of LLMs.
- PGDSF Replacement Policy: Considers the document order, size, frequency, and recency to minimize the cache miss rate.
- Cache-aware Reordering: Prioritizes requests with larger cached contexts and shorter recomputation demands to enhance cache efficiency.
- Dynamic Speculative Pipelining: Overlaps the knowledge retrieval and LLM inference steps to minimize the end-to-end latency while keeping the system load under control.
The experimental results show that RAGCache outperforms the state-of-the-art solutions by up to 4x on time to first token (TTFT) and 2.1x on throughput.
Stats
The average document length is 3,717.52 tokens, which is significantly longer than the average request length of 348.04 tokens.
The prefill latency with cached prefix is up to 11.5x lower than the full prefill latency.
The cache hit latency is up to 3.9x lower than the full prefill latency.
Quotes
"RAGCache caches the key-value tensors of retrieved documents across multiple requests to minimize redundant computation."
"The core of RAGCache is a knowledge tree with a prefix-aware Greedy Dual-Size Frequency (PGDSF) replacement policy that ensures caching the most critical key-value tensors."
"RAGCache also implements a global RAG controller that orchestrates interactions between the external knowledge database and LLM inference engine, with optimizations including cache-aware reordering and dynamic speculative pipelining."