Overlapping Collective Communication with Dependent Computation in Distributed Machine Learning using GPU-Initiated Intra-Kernel Networking
Fusing computation and collective communication within the same GPU kernel to effectively overlap communication with dependent computation, reducing overall execution time.