Core Concepts
A parallel symmetric eigensolver with communication-avoiding and communication-reducing algorithms is proposed to efficiently process very small matrices in massively parallel environments.
Abstract
The paper presents a parallel symmetric eigensolver, called ABCLib_DRSSED, designed for processing very small matrices in massively parallel computing environments. The key highlights are:
The target matrix sizes are limited to fit the cache sizes per node in a supercomputer, typically around 1,000 per node. This is motivated by the computational complexity of dense solvers, which becomes unrealistic for large matrix sizes in exascale computing.
Several communication-avoiding and communication-reducing algorithms are introduced based on MPI non-blocking implementations to minimize communication time. These include:
A communication-avoiding algorithm for the Householder tridiagonalization (TRD) step by reusing redundant pivot vectors.
A communication-reducing algorithm for the Householder inverse transformation (HIT) step by using blocking MPI_Bcast.
Thread parallelization of the MRRR algorithm in the symmetric eigenproblem (SEPT) step using the MEMS method.
The performance is evaluated on the Fujitsu FX10 supercomputer with up to 4,800 nodes (76,800 cores). Key findings:
The MPI non-blocking implementation is 3x more efficient than the baseline implementation in TRD.
The hybrid MPI execution is 1.9x faster than the pure MPI execution.
The proposed solver is 2.3x and 22x faster than the ScaLAPACK routine with optimized blocking size and cyclic-cyclic distribution, respectively.
The proposed solver is highly effective for matrix sizes that fit the L2 cache per node, achieving only a 3.97x increase in execution time when doubling the matrix dimension up to 83,138. However, the performance degrades for larger matrix sizes that exceed the L2 cache capacity.
Stats
The matrix size per node is approximately 980x980 in the case of the K-computer.
The proposed solver is 2.3x and 22x faster than the ScaLAPACK routine with optimized blocking size and cyclic-cyclic distribution, respectively.
Quotes
"The target matrix sizes are limited to fit the cache sizes per node in a supercomputer, typically around 1,000 per node."
"The proposed solver is highly effective for matrix sizes that fit the L2 cache per node, achieving only a 3.97x increase in execution time when doubling the matrix dimension up to 83,138."