The paper addresses the challenges of deploying large language models (LLMs) in streaming applications, where long interactions are expected. Two key issues are identified:
The authors observe an "attention sink" phenomenon, where LLMs allocate a surprisingly large amount of attention to the initial tokens, regardless of their semantic relevance. This is attributed to the Softmax operation, which requires attention scores to sum up to one.
Based on this insight, the authors propose StreamingLLM, a framework that keeps the attention sink tokens' KV alongside the sliding window's KV to anchor the attention computation and stabilize the model's performance. StreamingLLM enables models like Llama-2, MPT, Falcon, and Pythia to reliably model 4 million tokens and potentially more, without any fine-tuning.
The authors further demonstrate that pre-training LLMs with a dedicated attention sink token can improve streaming performance, eliminating the need for multiple initial tokens as sinks.
Experiments show that StreamingLLM outperforms the sliding window with re-computation baseline by up to 22.2× speedup, while maintaining a similar memory footprint. The framework is evaluated on long-text language modeling, streaming question answering, and other benchmarks, showcasing its efficiency and effectiveness.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Guangxuan Xi... at arxiv.org 04-09-2024
https://arxiv.org/pdf/2309.17453.pdfDeeper Inquiries