Core Concepts
An efficient implementation of the McMurchie-Davidson algorithm for evaluating 3-center and 4-center 2-particle integrals over Gaussian atomic orbitals (AOs) on modern GPUs, with support for AOs of low and high angular momenta and varying degrees of contraction.
Abstract
The authors report an implementation of the McMurchie-Davidson (MD) algorithm for evaluating 3-center and 4-center 2-particle integrals over Gaussian AOs on graphical processing units (GPUs). The work builds upon their previous implementation of a matrix form of the MD algorithm for efficient GPU evaluation of 4-center 2-particle integrals over high-angular-momentum Gaussian AOs.
The authors describe three variants of the MD algorithm, each targeting different classes of integrals:
- V0 variant for low-angular-momentum integrals, keeping all but the final results in registers
- V1 variant for medium-angular-momentum integrals, fusing steps to avoid main memory access
- V2 variant for high-angular-momentum integrals, using the matrix form of the MD scheme
The performance of the new implementation is assessed through microbenchmarks, showing sustained performance between 25% and 70% of the theoretical hardware peak for integrals over AOs with angular momentum up to l=6. A preliminary implementation of the Hartree-Fock exchange operator is also presented, demonstrating the ability to perform large-scale computations with up to quadruple-zeta basis sets and more than 20,000 AOs on a single 16GB V100 GPU.
The authors' C++ code is part of the experimental open-source LibintX library, available at github.com:ValeevGroup/LibintX.
Stats
The GPU-based implementation can achieve 25-70% of the theoretical hardware peak performance for integrals over AOs with angular momentum up to l=6.
For the [66|66] integral, the matrix formulation can perform well, approaching 50% of the hardware peak.
For medium-angular-momentum integrals like [33|33], the arithmetic intensity drops below the memory-bound threshold, necessitating the V1 variant.
For low-angular-momentum integrals, the V0 variant is used to keep data resident in registers and avoid main memory access.
Quotes
"To address the drastic drop-off in performance with the angular momentum some have advocated11 to use hybrid evaluation where the CPU engine is only used for integrals over high-l AOs, with the GPU engine only used for low l."
"Furthermore, robust density fitting basis sets are not available for wide swaths of the Periodic Table and automatic generation of density fitting basis sets can produce fitting bases with very high angular momenta (l ≫6). Thus it is currently not possible to completely eliminate the need for 4-center integral evaluation."