Harnessing GPU Performance Variability to Improve Scheduling of Machine Learning Workloads in GPU Clusters
Leveraging application-specific performance variability profiles and a novel placement policy called PAL, which co-optimizes for both performance variability and network locality, to significantly improve job completion times, cluster utilization, and makespan for machine learning workloads in GPU clusters.