Core Concepts
Identifying and mitigating the impact of slow-performing nodes in a large supercomputer cluster through the use of machine learning, proxy applications, and scheduling prioritization.
Abstract
The content describes a process to efficiently identify and mitigate the impact of slow-performing nodes in a large supercomputer cluster. Key highlights:
The author developed a set of proxy applications (both MPI and OpenMP-based) that can quickly assess node performance, as an alternative to the time-consuming High-Performance Linpack (HPL) benchmark.
The performance data from the proxy applications was analyzed using techniques like linear regression, Mahalanobis distance, and machine learning (Random Forests, neural networks) to identify outlier nodes that are significantly underperforming.
The author found that CPU speed had the largest variation in performance across nodes, indicating that efforts should focus on reducing CPU performance differences rather than memory or interconnect.
Strategies for mitigating the impact of slow nodes were discussed, including trimming, replacement, prioritized scheduling, and customized node preference lists.
The process was able to identify 12 out of 33 slow nodes using simple regression, at least 20 out of 33 using Mahalanobis distance, and 18 out of 33 using neural network regression, demonstrating the effectiveness of the approach.
Stats
The following sentences contain key metrics or figures:
"The sheer number of nodes continues to increase in today's supercomputers, the first half of Trinity alone contains more than 9400 compute nodes."
"Factory tests allowed Los Alamos extensive time to test many different applications on a two thousand node subset of Trinity."
"There were a total of 9327 nodes for which results were obtained."
"This serves to identify 12 nodes that are performing at least 3.5 standard deviations below the mean performance."
Quotes
"Since the speed of today's clusters are limited by the slowest nodes, it more important than ever to identify slow nodes, improve their performance if it can be done, and assure minimal usage of slower nodes during performance critical runs."
"Consequently, time considerations dictate that quickly running applications serve as a proxy for recognized performance standards such as the HPL."
"Performance outliers can also be found by computing the Mahalanobis Distance utilizing multivariate data."