Core Concepts
DEFT, an IO-aware tree attention algorithm, reduces memory access redundancy in tree-based decoding by leveraging the tree topology to minimize KV cache IO and eliminate IO of partial results during attention calculations.
Abstract
The paper proposes DEFT, an IO-aware tree attention algorithm, to accelerate large language model (LLM) inference with tree search.
The key insights are:
The IO workload for queries (Q) is negligible compared to that of the KV cache, as the maximum query length typically corresponds to root-to-leaf paths in the tree, resulting in relatively short queries compared to the KV cache length.
In tree-based decoding, multiple queries can share their common ancestor's KV cache during attention calculation, benefiting not only in terms of KV cache storage but also in reducing IOs.
DEFT consists of two phases:
QKV Preparation Phase:
DEFT splits the decoding tree by nodes and groups the KV cache of each node with all queries that share it, to minimize the IO of KV cache with negligible IO overhead of queries.
Attention Calculation Phase:
DEFT adopts a fused kernel to get partial attention with LogSumExp of QKV groups calculated in phase 1.
DEFT conducts a tree-topology-aware global reduction to get the final attention.
DEFT can achieve a speedup of 1.7-2.4 times across two practical reasoning tasks compared to the state-of-the-art attention algorithms, thanks to a 3.6-4.5x reduction in KV cache IO and a 25% reduction in IO for QK⊤and Softmax.
Stats
The total generated tokens of Chain-of-Thoughts (CoT) is only 525 while 24,026 in Tree-of-Thoughts (ToT), resulting in inefficiency in end-to-end latency (second) and IO (TB).
The IO mainly consists of three parts: (i) KV cache: IO-KV; (ii) QK⊤: IO-QKT; (iii) Softmax(QK⊤): IO-Softmax.