The paper addresses the challenges of deploying large language models (LLMs) in streaming applications, where long interactions are expected. Two key issues are identified:
The authors observe an "attention sink" phenomenon, where LLMs allocate a surprisingly large amount of attention to the initial tokens, regardless of their semantic relevance. This is attributed to the Softmax operation, which requires attention scores to sum up to one.
Based on this insight, the authors propose StreamingLLM, a framework that keeps the attention sink tokens' KV alongside the sliding window's KV to anchor the attention computation and stabilize the model's performance. StreamingLLM enables models like Llama-2, MPT, Falcon, and Pythia to reliably model 4 million tokens and potentially more, without any fine-tuning.
The authors further demonstrate that pre-training LLMs with a dedicated attention sink token can improve streaming performance, eliminating the need for multiple initial tokens as sinks.
Experiments show that StreamingLLM outperforms the sliding window with re-computation baseline by up to 22.2× speedup, while maintaining a similar memory footprint. The framework is evaluated on long-text language modeling, streaming question answering, and other benchmarks, showcasing its efficiency and effectiveness.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Guangxuan Xi... lúc arxiv.org 04-09-2024
https://arxiv.org/pdf/2309.17453.pdfYêu cầu sâu hơn