Sign In

ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Core Concepts
Proposing ATP, a low-rank self-attention mechanism, reduces complexity for transformers and LLMs by leveraging low-rank structures in input sequences.
ATP introduces a new attention mechanism that focuses on top principal keys rather than individual tokens. By analyzing the low-rank structure in input sequences, ATP reduces attention complexity from quadratic to linear. Evaluations on BERT and Llama models show comparable accuracy with reduced computation and memory complexity. The method effectively captures semantic relationships with fewer principal keys/values.
We propose a new attention mechanism with linear complexity, ATP. ATP transforms inputs into an orthogonal space and computes attention only on the top principal bases (keys). The attention complexity is reduced from quadratic to linear without noticeable performance drop. ATP barely loses accuracy with only 1/2 principal keys and incurs around 2% accuracy drop with 1/4 principal keys.
"We propose a new attention mechanism with linear complexity, ATP." "Owing to the observed low-rank structure in input sequences, ATP is able to capture semantic relationships." "Our evaluations demonstrate that ATP achieves comparable accuracy with much lower computation and memory complexity."

Key Insights Distilled From

by Yue Niu,Saur... at 03-06-2024

Deeper Inquiries

How can the concept of low-rank structures be applied to other machine learning models beyond transformers


What are the potential risks associated with rapidly deploying adverse LLM services using ATP


How can the findings of this study impact the development of more efficient language models in the future