toplogo
Sign In

Overlapping Collective Communication with Dependent Computation in Distributed Machine Learning using GPU-Initiated Intra-Kernel Networking


Core Concepts
Fusing computation and collective communication within the same GPU kernel to effectively overlap communication with dependent computation, reducing overall execution time.
Abstract
The paper proposes a novel approach to fuse computation and collective communication within the same GPU kernel, enabling fine-grain overlapping of communication and dependent computation. This is achieved by leveraging GPU-initiated intra-kernel networking capabilities, where GPU threads can directly initiate network transactions without relying on the host CPU. The key highlights of the approach are: Fused Computation-Communication Kernels: Developed three prototype fused operators: embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All to address communication bottlenecks in DLRM, Transformers, and Mixture of Experts (MoE) models. The fused kernels perform computation and communication concurrently, with GPU threads issuing non-blocking network transactions as soon as their share of computation is complete. For scale-up communication, the fused kernels use zero-copy optimizations where the computed results are directly written to the peer GPU memory, eliminating intermediate buffering. Integration with ML Frameworks: Exposed the fused operators as new PyTorch operators for transparent use by developers. Extended the Triton framework to include communication primitives, enabling users to develop custom fused kernels. Evaluation: Scale-up evaluation showed the fused embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All operators achieve 20-32%, 13-22%, and 12-20% lower execution time, respectively, compared to the baseline. Scale-out evaluation on a 128-node DLRM system demonstrated a 21% reduction in overall execution time using the fused embedding + All-to-All operator. Profiling analysis highlighted the effectiveness of communication-aware WG scheduling in reducing execution time skew across nodes. The proposed approach provides a practical solution to hide collective communication latency in distributed ML models by fusing it with dependent computation, without requiring any hardware changes.
Stats
Machine learning models have increased in size by five orders of magnitude between 2018 and 2022. The All-to-All collective operation contributes up to 35% of the overall latency in state-of-the-art DLRM systems. The AllReduce collective operation contributes up to 46% of the inference latency in Transformer models. The All-to-All collective operations contribute up to 43% of the execution time in Mixture of Experts (MoE) models.
Quotes
"In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies." "As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation." "Our evaluations show that our approach can effectively overlap communication with computations, subsequently reducing their combined execution time than the current collective library-based approaches."

Deeper Inquiries

How can the proposed approach be extended to handle irregular communication patterns observed in graph neural networks?

The proposed approach can be extended to handle irregular communication patterns in graph neural networks by incorporating dynamic communication scheduling mechanisms. In graph neural networks, the communication patterns are often non-uniform and depend on the connectivity of the nodes in the graph. To address this, the fused computation-communication kernels can be designed to adapt to the varying communication requirements of graph neural networks. This adaptation can involve implementing specialized communication primitives that can handle irregular data exchanges efficiently. By incorporating dynamic communication scheduling algorithms, the kernels can prioritize and optimize the communication tasks based on the specific connectivity patterns of the graph, ensuring that the communication overhead is minimized while maximizing computation overlap.

What are the potential challenges in automatically generating fused computation-communication kernels using compiler techniques?

Automatically generating fused computation-communication kernels using compiler techniques may face several challenges: Dependency Analysis: One of the key challenges is accurately identifying the dependencies between computation and communication tasks within the kernels. Compiler techniques need to analyze the data dependencies to ensure that communication tasks are overlapped with independent computations effectively. Optimization Overheads: Compiler optimizations for generating fused kernels may introduce additional overhead in terms of compilation time and complexity. Balancing the trade-off between optimization benefits and compilation costs is crucial. Hardware Abstraction: Compiler techniques need to be aware of the underlying hardware architecture to generate efficient fused kernels. Adapting the optimizations to different GPU architectures and communication protocols can be challenging. Dynamic Workload: Handling dynamic workloads and varying communication patterns in real-world applications can be complex. Compiler techniques need to be flexible enough to adjust the fusion strategies based on the changing requirements during runtime. Debugging and Profiling: Debugging fused kernels generated by compiler techniques can be challenging. Ensuring proper tools and techniques for debugging and profiling the fused kernels is essential for efficient development and optimization.

How can the proposed approach be adapted to handle dynamic changes in the communication patterns during the execution of distributed ML models?

To adapt the proposed approach to handle dynamic changes in communication patterns during the execution of distributed ML models, the following strategies can be employed: Dynamic Scheduling: Implement dynamic scheduling mechanisms that can adjust the communication-computation overlap based on real-time communication patterns. This involves continuously monitoring the communication requirements and dynamically reorganizing the workload to optimize overlap. Adaptive Fusion: Develop adaptive fusion techniques that can dynamically fuse and defuse computation and communication tasks based on the changing communication patterns. This adaptive fusion approach can ensure that the kernels are optimized for the current workload. Feedback Mechanisms: Incorporate feedback mechanisms that provide information about the communication performance during runtime. This feedback can be used to dynamically adjust the fusion strategies and optimize the overlap based on the observed communication patterns. Runtime Optimization: Implement runtime optimization algorithms that can analyze the communication patterns on-the-fly and make real-time decisions to maximize the overlap between computation and communication tasks. This dynamic optimization can enhance the efficiency of the fused kernels in handling dynamic changes in communication patterns.
0