核心概念
Attention offloading, a novel approach that separates the processing of the attention operator from the overall model evaluation, can significantly enhance the cost-efficiency and performance of large language model inference.
摘要
The paper presents an innovative concept called "attention offloading" to address the challenges in serving transformer-based large language models (LLMs). LLMs exhibit impressive performance in generative tasks but introduce significant challenges in real-world serving due to the inefficient use of expensive, computation-optimized accelerators.
The key insights are:
- The attention operator is memory-intensive, exhibiting a memory access pattern that clashes with the strengths of modern accelerators, especially as context length increases.
- By leveraging a collection of cheap, memory-optimized devices for the attention operator while still utilizing high-end accelerators for other parts of the model, the proposed heterogeneous setup ensures that each component is tailored to its specific workload, maximizing overall performance and cost-efficiency.
- The communication bandwidth required between heterogeneous devices is manageable with prevalent networking technologies, and various techniques are employed to reduce the additional latency introduced by attention offloading.
- The authors develop Lamina, a distributed heterogeneous LLM inference system that incorporates attention offloading. Experimental results indicate that Lamina can provide 1.48×–12.1× higher estimated throughput per dollar than homogeneous solutions.
统计
The minimum interconnect bandwidth required for attention offloading does not exceed 20GB/s, even when dealing with large models with batch sizes as high as 1024.
引用
"Attention offloading may introduce additional latency due to the added overhead of scheduling and networking. To mitigate this, we have employed various techniques, such as GPUDirect RDMA and device-side busy polling, which have proven effective in reducing data transfer times."
"With attention offloading, the inference process with a single batch results in underutilization of resources, as the memory device remains idle when the computation device is active, and vice versa. To address this inefficiency and enhance cost-effectiveness, we introduce staggered pipelining, an advanced technique that maximizes resource utilization without compromising inference latency."