Główne pojęcia
Keyformer introduces a novel approach to reduce KV cache size, improving inference latency and token generation throughput without compromising accuracy.
Streszczenie
Transformers are crucial for Large Language Models (LLMs), but face challenges with long-context processing. Keyformer reduces KV cache size by identifying key tokens, improving performance across various models and tasks. The Gumbel distribution proves effective in regularization for key token identification.
Statystyki
Keyformer reduces inference latency by 2.1× and improves token generation throughput by 2.4×.
Approximately 90% of attention weight focuses on key tokens.
Keyformer maintains accuracy while reducing KV cache size by up to 50%.
Cytaty
"Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as 'key' tokens."
"Even with a 50% reduction in KV cache, Keyformer maintains the desired accuracy threshold."