Efficient Inference of Large Language Models

insight - Efficient Inference of Large Language Models

Optimizing Inference of Large Language Models on AI Accelerators

Powerful foundation models, including large language models (LLMs), face significant challenges in deploying cost-effective and fast inference using AI accelerators. This tutorial offers comprehensive techniques to optimize LLM inference, including system optimizations, structured transformer architectures, model compression, and fast decoding strategies.

Dynamic-Width Speculative Beam Decoding for Efficient Inference of Large Language Models

Dynamic-width speculative beam decoding (DSBD) combines the speed advantages of speculative decoding with the accuracy and diversity benefits of beam sampling to enable more efficient and effective inference of large language models.

Optimizing Large Language Model Inference with Attention Offloading: Enhancing Cost-Efficiency and Performance

Attention offloading, a novel approach that separates the processing of the attention operator from the overall model evaluation, can significantly enhance the cost-efficiency and performance of large language model inference.

Efficient Inference of Large Language Models Using KCache Technique

KCache is a novel technique that efficiently reduces the memory footprint during inference of large language models, achieving a 40% increase in throughput while maintaining accuracy.

Batched Attention-optimized Speculative Sampling (BASS): A Novel Approach to Accelerate Multi-Sequence Generation with Large Language Models

BASS is a system that enables batched speculative decoding of large language models, achieving superior latency, GPU utilization, and accuracy compared to prior approaches.

About

Products

Resources