Core Concepts
The article re-examines the COnfLUX 2.5D LU factorization algorithm proposed by Kwasniewski et al., identifying potential issues in the upper bound analysis, experimental methods, and lower bound derivation.
Abstract
The article conducts a technical re-examination of the COnfLUX algorithm and its associated analyses presented in the original paper by Kwasniewski et al. The key points are:
Upper Bound Analysis:
The use of a 1D decomposition for panel factorization and TRSM in the A10 and A01 regions may not fully utilize the communication capabilities of all processors, leading to an underestimation of the communication bandwidth cost.
The original paper's calculation of the bandwidth cost distributes the cost across all p processors, rather than the actual (p^(1/2)_1 * c) processors actively involved.
The corrected bandwidth cost for the A10 and A01 regions is Ω(n^2/p^(1/2)) or Ω(n^2/p^(1/3)), which is asymptotically greater than the remaining algorithmic steps.
Empirical Study Concerns:
The original code base only tested certain processor grid configurations and did not evaluate the communication-optimal configurations stated in the paper, potentially affecting the validity of the claims regarding the algorithm's communication optimality.
Lower Bound Derivation:
The lower bound derivation may oversimplify the matter by not considering the fact that in parallel computation, the total amount of I/O operations typically increases proportionally to the number of processors, which is usually asymptotically larger than in the sequential case.
The article aims to enhance the understanding and development of parallel matrix factorization algorithms by addressing these potential issues in the original work.
Stats
The communication bandwidth cost for the reduction in the A10 and A01 regions is at least Ω(n^2/p^(1/2)) or Ω(n^2/p^(1/3)), which is asymptotically greater than the remaining algorithmic steps.