Core Concepts
Parcae optimizes DNN training on preemptible instances proactively to reduce costs and improve performance.
Abstract
The content discusses Parcae, a system for proactive, cost-effective, and scalable DNN training on preemptible cloud instances. It introduces the concept of liveput to optimize training throughput under preemption scenarios. Parcae uses migration strategies like intra-stage, inter-stage, and pipeline migration to handle preemptions efficiently. The availability predictor forecasts spot instance availability for optimization. Parcae's liveput optimizer dynamically adjusts parallel configurations to maximize performance.
Stats
"Compared to existing reactive systems, Parcae outperforms by up to 10×."
"A single training run of GPT-3 costs $4.6 million on AWS."
"Parcae achieves near-optimal performance for large DNNs under frequent preemptions."
Quotes
"Deep neural networks are becoming progressively large and costly to train."
"Existing systems use a reactive approach that only achieves limited performance and scalability."
"Parcae's proactive solution considers both job throughput and robustness under preemptions."