ExeGPT, a distributed system, finds and runs with an optimal execution schedule to maximize inference throughput while satisfying a given latency constraint by leveraging the distribution of input and output sequences to effectively allocate resources and determine optimal execution configurations.
By identifying the importance of attention layers, SQUEEZEATTENTION optimizes the KV-cache jointly from both the sequence and layer dimensions, achieving significant memory and throughput improvements for LLM inference.
DEFT, an IO-aware tree attention algorithm, reduces memory access redundancy in tree-based decoding by leveraging the tree topology to minimize KV cache IO and eliminate IO of partial results during attention calculations.
A novel approach to enable usage of low-precision block floating point formats without compromising the resulting model accuracy, by exploiting the common channel-wise patterns exhibited by outliers in weights and activations.