The content describes a novel hardware architecture for the attention mechanism in large language models (LLMs) using analog in-memory computing (IMC) based on gain cell memories. The key highlights are:
The architecture eliminates the need to load the key (K) and value (V) projections from GPU memory for each inference step, which is a major bottleneck in GPU-based LLM inference. Instead, the K and V projections are stored directly in the analog gain cell arrays and the attention computations are performed entirely in the analog domain.
The architecture utilizes Sliding Window Attention, which keeps track of only the most recent tokens, to reduce the memory requirements compared to full attention. The gain cell arrays are written and read in a sequential manner to implement the sliding window.
The authors propose an algorithm-hardware co-optimization approach, including a hardware-aware fine-tuning method that adapts pre-trained LLM weights to the constraints of the analog gain cell hardware. This allows the model to achieve performance comparable to a pre-trained ChatGPT-2 model with minimal additional training.
The end-to-end hardware design, including digital controls, is estimated to reduce attention latency by up to two orders of magnitude and energy consumption by up to five orders of magnitude compared to GPUs, enabling ultra-fast and low-power sequence generation in LLMs.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询