The paper explores fusing depthwise (DW) and pointwise (PW) convolutions to overcome the memory access bottleneck on GPUs. It proposes Fused Convolutional Modules (FCMs), a set of novel fused GPU kernels that combine DW and PW convolutions. FCMs reduce the global memory accesses of these convolutions, improving execution time and energy efficiency.
The paper also introduces FusePlanner, which consists of cost models to estimate the global memory accesses of DW, PW, and FCM kernels given GPU characteristics. FusePlanner decides which layers benefit from fusion and the optimal FCM parameters to minimize global memory accesses.
The evaluation on three GPUs using representative CNNs and ViTs shows that FCMs achieve up to 3.7x speedup over cuDNN and save up to 83% of the global memory accesses compared to cuDNN. End-to-end implementations of four CNNs using the proposed kernels achieve up to 1.8x speedup compared to TVM implementations and save up to two-thirds of the energy per inference.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Fareed Qarar... om arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19331.pdfDiepere vragen