Core Concepts
A systematic training method called ScaleFold that incorporates optimizations to address the key factors preventing the AlphaFold training from scaling to more compute resources, enabling it to be completed in 10 hours.
Abstract
The authors conducted a comprehensive analysis on the AlphaFold training procedure and identified that inefficient communications and overhead-dominated computations were the key factors preventing the training from scaling effectively.
To address these challenges, the authors introduced ScaleFold, a systematic training method that incorporated the following optimizations:
- Employed FastFold's Dynamic Axial Parallelism (DAP) to allow scaling up the number of GPUs.
- Implemented a Non-Blocking Data Pipeline to eliminate the negative impact of highly unequal access time to training batches.
- Enabled CUDA Graph to eliminate CPU overhead.
- Implemented efficient Triton kernels for critical patterns like Multi-Head Attention, LayerNorm, and FusedAdam combined with Stochastic Weight Average.
- Applied PyTorch's compiler to automatically fuse memory-bound operations.
- Batched GEMMs and eliminated the overhead of Gradient Clipping.
- Enabled bfloat16 training.
- Implemented asynchronous evaluation to free training nodes from evaluation.
With these optimizations, the authors demonstrated the scalability and efficiency of ScaleFold. In the MLPerf HPC V3.0 OpenFold benchmark, ScaleFold's training time to convergence was reduced to 7.51 minutes on 2080 NVIDIA H100 GPUs, achieving a 6x speedup over the reference model. Furthermore, the authors were able to train ScaleFold from scratch and reduce the initial training time from 7 days to 10 hours, setting a new record compared to prior works.
Stats
ScaleFold achieved a 6x speedup over the reference model in the MLPerf HPC V3.0 OpenFold benchmark, reducing the training time to convergence to 7.51 minutes on 2080 NVIDIA H100 GPUs.
ScaleFold reduced the initial training time for the AlphaFold model from 7 days to 10 hours.
Quotes
"ScaleFold successfully addressed the scalability issue and scaled the AlphaFold training to 2080 NVIDIA H100 GPUs, whereas prior arts only scaled up to 512."
"In the MLPef HPC v3.0 benchmark, ScaledFold finished the OpenFold partial training task in 7.51 minutes, over 6× faster than the benchmark baseline."
"For training the AlphaFold model from scratch, ScaleFold finished the pretraining (i.e., initial training phase) in 10 hours, set a new record compared to prior works."