Core Concepts
Optimizing ZeRO for efficient large language model training.
Abstract
The article introduces AMSP, a system designed to optimize ZeRO for scalable LLM training. It proposes flexible sharding strategies and efficient communication-computation overlap. Evaluations show significant improvements in Model FLOPs Utilization (MFU) compared to other systems like MiCS and ZeRO++. Challenges in large-scale LLM training with ZeRO are discussed, along with proposed solutions by AMSP.
Stats
Evaluations demonstrate up to 52% Model FLOPs Utilization (MFU) when training the LLaMA-based model on 1024 GPUs.
AMSP improves the training throughput by a factor of 1.4 − 12.7 on 1024 GPUs for training LLaMA-based models.
Compared to MiCS and ZeRO++, AMSP shows higher MFU rates at 51%, 52%, and 42% on LLaMA-7B, LLaMA-13B, and LLaMA-30B training.