ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys
المفاهيم الأساسية
Proposing ATP, a low-rank self-attention mechanism, reduces complexity for transformers and LLMs by leveraging low-rank structures in input sequences.
الملخص
ATP introduces a new attention mechanism that focuses on top principal keys rather than individual tokens. By analyzing the low-rank structure in input sequences, ATP reduces attention complexity from quadratic to linear. Evaluations on BERT and Llama models show comparable accuracy with reduced computation and memory complexity. The method effectively captures semantic relationships with fewer principal keys/values.
ATP
الإحصائيات
We propose a new attention mechanism with linear complexity, ATP.
ATP transforms inputs into an orthogonal space and computes attention only on the top principal bases (keys).
The attention complexity is reduced from quadratic to linear without noticeable performance drop.
ATP barely loses accuracy with only 1/2 principal keys and incurs around 2% accuracy drop with 1/4 principal keys.
اقتباسات
"We propose a new attention mechanism with linear complexity, ATP."
"Owing to the observed low-rank structure in input sequences, ATP is able to capture semantic relationships."
"Our evaluations demonstrate that ATP achieves comparable accuracy with much lower computation and memory complexity."