Powerful foundation models, including large language models (LLMs), face significant challenges in deploying cost-effective and fast inference using AI accelerators. This tutorial offers comprehensive techniques to optimize LLM inference, including system optimizations, structured transformer architectures, model compression, and fast decoding strategies.
Dynamic-width speculative beam decoding (DSBD) combines the speed advantages of speculative decoding with the accuracy and diversity benefits of beam sampling to enable more efficient and effective inference of large language models.
Attention offloading, a novel approach that separates the processing of the attention operator from the overall model evaluation, can significantly enhance the cost-efficiency and performance of large language model inference.
KCache is a novel technique that efficiently reduces the memory footprint during inference of large language models, achieving a 40% increase in throughput while maintaining accuracy.
BASS is a system that enables batched speculative decoding of large language models, achieving superior latency, GPU utilization, and accuracy compared to prior approaches.