Optimizing Inference of Large Language Models on AI Accelerators
Powerful foundation models, including large language models (LLMs), face significant challenges in deploying cost-effective and fast inference using AI accelerators. This tutorial offers comprehensive techniques to optimize LLM inference, including system optimizations, structured transformer architectures, model compression, and fast decoding strategies.