Core Concepts
LIBRA, a workload-aware design-time framework, can optimize the bandwidth distribution across multiple network dimensions to maximize training performance or performance-per-cost for distributed training of large AI models.
Abstract
The content discusses the design of multi-dimensional network topologies as a cost-efficient mechanism to enhance overall network bandwidth for distributed training of large AI models. It introduces LIBRA, a framework that focuses on optimizing multi-dimensional fabric architectures.
Key highlights:
As AI models continue to scale, distributed training is necessary to accommodate model weights and reduce training time. However, this leads to increased communication overhead, which becomes a critical bottleneck.
Multi-dimensional network topologies, composed of multiple network technologies, can provide higher aggregate bandwidth per NPU and better performance-per-cost compared to traditional 2D networks.
LIBRA is a workload-aware, design-time optimization framework that can determine the optimal bandwidth distribution across network dimensions to maximize training performance or performance-per-cost, while adhering to various design constraints.
LIBRA models collective communications, distributed training, and network costs to estimate the end-to-end training time and network cost, which are then used to optimize the network design.
Case studies demonstrate that LIBRA-optimized networks can achieve up to 2.0x speedup and 13.0x performance-per-cost improvement over a baseline equal-bandwidth network configuration.
Stats
The total communication size for large model training can span GBs to TBs.
MSFT-1T model has 1 trillion parameters.
GPT-3 model has 175 billion parameters.
Turing-NLG model has 17 billion parameters.
Quotes
"As model sizes in machine learning continue to scale, distributed training is necessary to accommodate model weights within each device and to reduce training time. However, this comes with the expense of increased communication overhead due to the exchange of gradients and activations, which become the critical bottleneck of the end-to-end training process."
"We believe that a promising approach to enhance the BW per NPU is to (i) explicitly add more network dimensions, and (ii) leverage a mixture of fabric technologies."