Conceitos Básicos
Prompt Cache is an approach that accelerates inference for large language models by reusing attention states across different prompts through a modular and positionally coherent prompt structure.
Resumo
The key insights and highlights of the content are:
Many input prompts for large language models (LLMs) have overlapping text segments, such as system messages, prompt templates, and context documents. This presents an opportunity to reuse attention states across prompts.
Prompt Cache introduces a novel technique to enable modular attention state reuse. It uses a Prompt Markup Language (PML) to explicitly define reusable text segments called "prompt modules" in a schema.
Prompt Cache precomputes and stores the attention states for these prompt modules. When a prompt is served, it retrieves the cached attention states for the imported prompt modules and computes the attention states only for the uncached segments.
Prompt Cache tackles two key challenges: 1) position-dependence of attention states, and 2) efficient recognition of cached text segments. It solves these by assigning unique position IDs to prompt modules and leveraging the transformer's ability to operate on attention states with discontinuous position IDs.
Evaluations on benchmark datasets show that Prompt Cache can reduce time-to-first-token (TTFT) latency by up to 8x on GPUs and 60x on CPUs, while maintaining output accuracy.
Prompt Cache can be used as a building block for future LLM serving systems, enabling further optimizations like cache replacement strategies and host-to-device memory overhead reduction.
Estatísticas
Prompt Cache reduces TTFT latency by up to 8x on GPUs and 60x on CPUs compared to the baseline.
The memory overhead of caching attention states scales linearly with the number of tokens, ranging from 0.03 MB/token for BERT to 4.53 MB/token for Falcon 180B.
Citações
"Prompt Cache is motivated by the observation that input prompts served by LLM servers often share components in a highly structured manner."
"The key idea is to precompute attention states of the frequently revisited prompt segments in memory for reuse."