Core Concepts
Memory bandwidth is becoming the primary bottleneck in AI applications, particularly for serving large language models, necessitating a redesign in model architecture and strategies.
Abstract
The content discusses the shift of performance bottleneck in AI applications towards memory bandwidth, highlighting the disparity between hardware FLOPS scaling and DRAM/interconnect bandwidth growth. It emphasizes how memory limitations are increasingly affecting AI tasks, especially for serving large models. The analysis includes historical observations, recent trends, case studies on Transformers, efficient training algorithms, deployment challenges, and rethinking AI accelerator designs.
Abstract:
Unprecedented surge in model size due to unsupervised training data.
Memory bandwidth emerges as the primary bottleneck over compute.
Analysis of encoder and decoder Transformer models' memory constraints.
Introduction:
Compute growth rate for Large Language Models (LLMs) at 750×/2yrs.
Emerging challenge with memory and communication bottlenecks.
Historical observations on memory bandwidth limitations.
Data Extraction:
"Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0×/2yrs."
"DRAM and interconnect bandwidth have only scaled at 1.6 and 1.4 times every 2 years."
Quotations:
"Each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs."
"No exponential can continue forever," [28]
Case Study:
Analyzing Transformer inference characteristics and bottlenecks.
Importance of considering Arithmetic Intensity for performance evaluation.
Efficient Deployment:
Solutions like quantization, pruning redundant parameters, or designing small language models to address deployment challenges.
Rethinking AI Accelerators:
Challenges in increasing memory bandwidth alongside peak compute capability.
Stats
Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0×/2yrs.
DRAM BW: Interconnect BW: 100x / 20 yrs (1.6x/2yrs)
30x / 20 yrs (1.4x/2yrs)
Quotes
"Each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs."
"No exponential can continue forever," [28]