Core Concepts

An efficient implementation of the McMurchie-Davidson algorithm for evaluating 3-center and 4-center 2-particle integrals over Gaussian atomic orbitals (AOs) on modern GPUs, with support for AOs of low and high angular momenta and varying degrees of contraction.

Abstract

The authors report an implementation of the McMurchie-Davidson (MD) algorithm for evaluating 3-center and 4-center 2-particle integrals over Gaussian AOs on graphical processing units (GPUs). The work builds upon their previous implementation of a matrix form of the MD algorithm for efficient GPU evaluation of 4-center 2-particle integrals over high-angular-momentum Gaussian AOs.
The authors describe three variants of the MD algorithm, each targeting different classes of integrals:
V0 variant for low-angular-momentum integrals, keeping all but the final results in registers
V1 variant for medium-angular-momentum integrals, fusing steps to avoid main memory access
V2 variant for high-angular-momentum integrals, using the matrix form of the MD scheme
The performance of the new implementation is assessed through microbenchmarks, showing sustained performance between 25% and 70% of the theoretical hardware peak for integrals over AOs with angular momentum up to l=6. A preliminary implementation of the Hartree-Fock exchange operator is also presented, demonstrating the ability to perform large-scale computations with up to quadruple-zeta basis sets and more than 20,000 AOs on a single 16GB V100 GPU.
The authors' C++ code is part of the experimental open-source LibintX library, available at github.com:ValeevGroup/LibintX.

Stats

The GPU-based implementation can achieve 25-70% of the theoretical hardware peak performance for integrals over AOs with angular momentum up to l=6.
For the [66|66] integral, the matrix formulation can perform well, approaching 50% of the hardware peak.
For medium-angular-momentum integrals like [33|33], the arithmetic intensity drops below the memory-bound threshold, necessitating the V1 variant.
For low-angular-momentum integrals, the V0 variant is used to keep data resident in registers and avoid main memory access.

Quotes

"To address the drastic drop-off in performance with the angular momentum some have advocated11 to use hybrid evaluation where the CPU engine is only used for integrals over high-l AOs, with the GPU engine only used for low l."
"Furthermore, robust density fitting basis sets are not available for wide swaths of the Periodic Table and automatic generation of density fitting basis sets can produce fitting bases with very high angular momenta (l ≫6). Thus it is currently not possible to completely eliminate the need for 4-center integral evaluation."

Key Insights Distilled From

by Andrey Asadc... at **arxiv.org** 05-06-2024

Deeper Inquiries

The proposed MD variants could be extended to support other types of basis functions beyond Gaussian atomic orbitals (AOs) by adapting the matrix formulation to accommodate different basis function types. For Slater-type orbitals, which have a different functional form compared to Gaussian AOs, the expansion coefficients and transformation rules would need to be adjusted accordingly. The Hermite-to-AO transformation would have to be modified to account for the different mathematical expressions of Slater-type orbitals. Additionally, the Boys function or similar quantities used in the MD scheme may need to be redefined to suit the Slater-type orbital basis.
For numerical atomic orbitals (NAOs), which are typically represented as linear combinations of basis functions, the MD variants could be extended by incorporating the appropriate numerical integration techniques. The matrix formulation would need to handle the numerical integration of NAOs over the desired molecular systems. This would involve adapting the integral evaluation kernels to work with the numerical grid points and weights associated with NAOs.

The matrix formulation of the MD scheme, while efficient for Gaussian AO integrals, may face limitations or challenges when applied to other types of multi-center integrals, such as those involving electron-nuclear or nuclear-nuclear interactions. Some potential limitations and challenges include:
Complexity of Interaction Terms: Electron-nuclear and nuclear-nuclear interactions involve more complex interaction terms compared to electron-electron interactions in molecular systems. The matrix formulation may need to account for additional terms and factors in the integral evaluation process.
Different Mathematical Expressions: The mathematical expressions for electron-nuclear and nuclear-nuclear interactions may differ from those of Gaussian AO integrals. Adapting the matrix formulation to handle these different expressions could be challenging and may require significant modifications.
Numerical Stability: The numerical stability of the matrix formulation for multi-center integrals involving different types of interactions needs to be carefully considered. The accuracy and precision of the calculations may be affected by the complexity of the interaction terms.
Computational Cost: Electron-nuclear and nuclear-nuclear interactions typically involve more computationally intensive calculations compared to electron-electron interactions. The matrix formulation may need to be optimized to handle the increased computational cost efficiently.

The techniques used in the V0 and V1 variants of the MD scheme could potentially be applied to other integral evaluation algorithms to improve their performance on low- and medium-angular-momentum integrals. By optimizing memory usage, register allocation, and parallelization strategies, similar performance enhancements could be achieved for other integral evaluation algorithms.
Some ways in which the techniques used in V0 and V1 variants could be applied to other integral evaluation algorithms include:
Memory Management: Efficient memory management strategies, such as minimizing data transfers between different levels of memory hierarchy, can improve performance for algorithms dealing with low- and medium-angular-momentum integrals.
Parallelization: Utilizing parallel computing techniques, such as thread-level parallelism and data parallelism, can enhance the performance of integral evaluation algorithms on low- and medium-angular-momentum integrals.
Optimized Data Structures: Implementing optimized data structures and algorithms tailored to the specific characteristics of low- and medium-angular-momentum integrals can lead to improved computational efficiency.
By incorporating these techniques, integral evaluation algorithms beyond the MD scheme could benefit from enhanced performance on a wide range of integral types and angular momenta.

0