Core Concepts
DeepVM recommends cost-effective cluster configurations by balancing Spot and On-Demand VMs, enhancing efficiency in deep learning training.
Abstract
DeepVM introduces a novel solution to optimize deep learning clusters by intelligently utilizing Spot and On-Demand VMs. It addresses challenges of checkpointing with Spot VMs, providing cost-effective alternatives for users. The system leverages linear programming to identify optimal configurations tailored to user-specific needs, consistently outperforming other policies in simulations and real-world deployments.
Stats
DeepVM leverages the FLOPP metric to analyze instance performance.
Extensive simulations demonstrate DeepVM's superiority in reducing training costs.
Real-world deployments on AWS show DeepVM's effectiveness in optimizing cluster configurations.