核心概念
FastKron introduces a novel approach to Kron-Matmul on GPUs, enabling significant performance improvements by avoiding common inefficiencies in existing implementations.
摘要
FastKron presents an efficient technique for Kron-Matmul on single and multiple GPUs, outperforming existing implementations. The algorithm divides rows of the input matrix into slices and columns, optimizing memory access and communication volume. By fusing multiple sliced multiplications, FastKron minimizes global memory accesses and enhances performance significantly.
The content discusses the limitations of current algorithms for Kron-Matmul, introduces FastKron as a solution to address these issues, explains the implementation details of FastKron's CUDA kernel with shared memory caching and fusion mechanisms, and describes autotuning parameters for optimal performance. Additionally, it covers distributed Kron-Matmul across multiple GPUs to reduce communication overhead.
Key points include:
Introduction to Kronecker Matrix-Matrix Multiplication (Kron-Matmul)
Existing algorithms like the shuffle algorithm and FTMMT algorithm
Limitations of current implementations leading to inefficiencies
FastKron's approach to optimize Kron-Matmul on GPUs
Implementation details including shared memory caching and fusion techniques
Autotuning parameters for efficient computation
Distributed Kron-Matmul strategy across multiple GPUs
統計資料
FastKron performs up to 40.7× faster than existing implementations on 1 GPU.
On a system with 16 NVIDIA Tesla V100 GPUs, FastKron is 7.85× better than CTF.
FastKron reduces training time of Gaussian Process techniques by up to 6.20×.
引述
"FastKron provides significant performance speedup over state-of-the-art single and multi-GPU Kron-Matmul implementations."
"Existing linear algebra kernels used in Kron-Matmul miss several optimizations that FastKron addresses efficiently."