核心概念
The Holmes framework enables efficient distributed training of large language models across GPU clusters with heterogeneous network interface cards, outperforming existing frameworks in the heterogeneous NIC environment.
摘要
The paper introduces the Holmes framework, which is designed to enable efficient distributed training of large language models (LLMs) across GPU clusters with heterogeneous network interface cards (NICs).
Key highlights:
- LLM training often requires extensive GPU resources (tens of thousands of GPUs) and can be very costly. Existing training frameworks focus on optimizing training within homogeneous GPU clusters with high-speed RDMA interconnects.
- The Holmes framework addresses the challenge of training LLMs in heterogeneous NIC environments, where GPU clusters may have different types of NICs (InfiniBand, RoCE, Ethernet) that are not compatible with each other.
- Holmes employs a novel scheduling method that intelligently allocates computational tasklets to GPU devices based on the characteristics of their connected NICs, maximizing training efficiency.
- Holmes also introduces cross-cluster pipeline parallelism and self-adapting pipeline partition strategies to further optimize training in the heterogeneous NIC environment.
- Comprehensive experiments show that Holmes consistently achieves performance close to that of homogeneous RDMA-capable networks, significantly outperforming training efficiency in pure Ethernet environments.
- Holmes seamlessly integrates with other mainstream LLM training frameworks like Megatron-LM and Megatron-DeepSpeed.
統計資料
Training a GPT model with 3.6 billion parameters on 4 nodes, the TFLOPS achieved are 197 for InfiniBand, 160 for RoCE, and 122 for Ethernet.
Training the same GPT model on 8 nodes, the TFLOPS achieved are 148 for InfiniBand, 145 for RoCE, and 83 for Ethernet.
The time cost of the grads-reduce-scatter operation, a critical step in data parallelism, is significantly lower in the homogeneous InfiniBand environment compared to the heterogeneous Ethernet environment.
引述
"Holmes consistently achieves performance levels close to those achievable with homogeneous RDMA-capable networks, significantly exceeding training efficiency within the pure Ethernet environment."
"Holmes seamlessly integrates with other mainstream LLM training frameworks such as Megatron-LM and Megatron-DeepSpeed."