Core Concepts
This paper introduces a novel bandwidth-optimal Allgather algorithm for distributed AI training that leverages hardware multicast and SmartNIC offloading to overcome bandwidth bottlenecks and improve the efficiency of collective operations in distributed training pipelines.
Khalilov, M., Di Girolamo, S., Chrapek, M., Nudelman, R., Bloch, G., & Hoefler, T. (2024). Network-Offloaded Bandwidth-Optimal Broadcast and Allgather for Distributed AI. arXiv preprint arXiv:2408.13356.
This paper aims to address the bandwidth bottleneck in distributed AI training, particularly in the context of Fully Sharded Data Parallel (FSDP) training, by proposing a novel bandwidth-optimal Allgather algorithm that leverages hardware multicast and SmartNIC offloading.