แนวคิดหลัก
Joint consideration of scheduling and adaptive parallelism can significantly improve training efficiency in heterogeneous GPU clusters.
บทคัดย่อ
The content discusses the challenges of integrating adaptive parallelism into cluster scheduling to optimize large model training. It introduces Crius, a system that efficiently schedules multiple large models with adaptive parallelism in heterogeneous clusters. Crius proposes a novel scheduling granularity called Cell, which allows for accurate performance estimation and efficient job scheduling. Experimental results show significant improvements in job completion time and cluster throughput.
Directory:
Introduction
Challenges of integrating adaptive parallelism into cluster scheduling.
Crius System Overview
Introduction of Crius and its novel scheduling granularity, Cell.
Data Extraction Techniques
The exponentially enlarged scheduling space hinders performance data acquisition.
Experimental Results
Evaluation of Crius on physical testbed with 64 GPUs.
Performance Analysis
Comparison of Crius with baselines on a real testbed.
สถิติ
Experimental results show that Crius reduces job completion time by up to 48.9%.
Crius achieves up to 1.49× cluster throughput improvement on the real testbed.
คำพูด
"Integrating adaptive parallelism into a cluster scheduler expands the cluster scheduling space."
"Crius reduces job completion time by up to 48.9%."