核心概念
Large-scale distributed model training is susceptible to frequent machine failures, leading to significant downtime and economic losses. Minder, an automated faulty machine detection system, leverages machine-level similarity and continuity patterns in monitoring metrics to quickly and accurately identify faulty machines, minimizing manual effort and downtime.
Deng, Y., Shi, X., Jiang, Z., Zhang, X., Zhang, L., Li, B., Song, Z., Zhu, H., Liu, G., Li, F., Wang, S., Lin, H., Ye, J., & Yu, M. (2024). Minder: Faulty Machine Detection for Large-scale Distributed Model Training. arXiv preprint arXiv:2411.01791v1.
This paper introduces Minder, a system designed to automatically detect faulty machines in large-scale distributed model training environments, addressing the challenges of frequent hardware and software failures that lead to significant downtime and economic losses.