Core Concepts
Fusing computation and collective communication within the same GPU kernel to effectively overlap communication with dependent computation, reducing overall execution time.
Abstract
The paper proposes a novel approach to fuse computation and collective communication within the same GPU kernel, enabling fine-grain overlapping of communication and dependent computation. This is achieved by leveraging GPU-initiated intra-kernel networking capabilities, where GPU threads can directly initiate network transactions without relying on the host CPU.
The key highlights of the approach are:
Fused Computation-Communication Kernels:
Developed three prototype fused operators: embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All to address communication bottlenecks in DLRM, Transformers, and Mixture of Experts (MoE) models.
The fused kernels perform computation and communication concurrently, with GPU threads issuing non-blocking network transactions as soon as their share of computation is complete.
For scale-up communication, the fused kernels use zero-copy optimizations where the computed results are directly written to the peer GPU memory, eliminating intermediate buffering.
Integration with ML Frameworks:
Exposed the fused operators as new PyTorch operators for transparent use by developers.
Extended the Triton framework to include communication primitives, enabling users to develop custom fused kernels.
Evaluation:
Scale-up evaluation showed the fused embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All operators achieve 20-32%, 13-22%, and 12-20% lower execution time, respectively, compared to the baseline.
Scale-out evaluation on a 128-node DLRM system demonstrated a 21% reduction in overall execution time using the fused embedding + All-to-All operator.
Profiling analysis highlighted the effectiveness of communication-aware WG scheduling in reducing execution time skew across nodes.
The proposed approach provides a practical solution to hide collective communication latency in distributed ML models by fusing it with dependent computation, without requiring any hardware changes.
Stats
Machine learning models have increased in size by five orders of magnitude between 2018 and 2022.
The All-to-All collective operation contributes up to 35% of the overall latency in state-of-the-art DLRM systems.
The AllReduce collective operation contributes up to 46% of the inference latency in Transformer models.
The All-to-All collective operations contribute up to 43% of the execution time in Mixture of Experts (MoE) models.
Quotes
"In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies."
"As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation."
"Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches."