Core Concepts
A novel low-precision training strategy called COLLAGE that utilizes multi-component float representation to accurately perform operations with numerical errors accounted, enabling efficient training of large language models without compromising accuracy.
Abstract
The content discusses the challenges of training large language models (LLMs) due to the intense compute cost and limited hardware memory. A practical solution is to use low-precision representation, but this can lead to loss in numerical accuracy and unstable training.
The authors propose COLLAGE, a new approach that utilizes multi-component float (MCF) representation in low-precision to accurately perform operations while accounting for numerical errors. COLLAGE is designed to be a plugin that can be easily integrated with existing optimizers like AdamW, making the optimizer precision-aware.
The key highlights of COLLAGE are:
It avoids the need for high-precision master weights and upcasting, achieving memory efficiency.
It introduces a novel metric called "effective descent quality" to measure the loss of information during the training process, providing better understanding of the impact of different precision strategies.
COLLAGE matches or exceeds the performance of the state-of-the-art mixed-precision strategy with FP32 master weights, while achieving up to 3.7x speedup and 15-23% less memory usage.
COLLAGE trains accurate models using only low-precision storage, without compromising the quality compared to the FP32 master weights counterpart.
The authors evaluate COLLAGE on pretraining various LLM architectures, including BERT, RoBERTa, GPT, and OpenLLaMA, demonstrating its effectiveness in improving training efficiency while maintaining model quality.
Stats
COLLAGE achieves up to 3.7x speedup in training throughput compared to the mixed-precision strategy with FP32 master weights.
COLLAGE reduces peak GPU memory usage by 15-23% on average compared to the mixed-precision strategy with FP32 master weights.
COLLAGE matches or exceeds the performance of the mixed-precision strategy with FP32 master weights on various pretraining and finetuning tasks.
Quotes
"COLLAGE offers wall-clock time speedups by storing all variables in low-precision without upcasting."
"COLLAGE trains accurate models using only low-precision storage compared with FP32 master-weights counterpart."