Minder: An Automated System for Detecting Faulty Machines in Large-Scale Distributed Model Training
Large-scale distributed model training is susceptible to frequent machine failures, leading to significant downtime and economic losses. Minder, an automated faulty machine detection system, leverages machine-level similarity and continuity patterns in monitoring metrics to quickly and accurately identify faulty machines, minimizing manual effort and downtime.