Optimizing Multi-dimensional Network Bandwidth for Efficient Distributed Training of Large AI Models
LIBRA, a workload-aware design-time framework, can optimize the bandwidth distribution across multiple network dimensions to maximize training performance or performance-per-cost for distributed training of large AI models.