핵심 개념
Fused Convolutional Modules (FCMs) significantly reduce the memory access bottleneck of depthwise and pointwise convolutions, leading to low-latency and energy-efficient execution on GPUs.
초록
The paper explores fusing depthwise (DW) and pointwise (PW) convolutions to overcome the memory access bottleneck on GPUs. It proposes Fused Convolutional Modules (FCMs), a set of novel fused GPU kernels that combine DW and PW convolutions. FCMs reduce the global memory accesses of these convolutions, improving execution time and energy efficiency.
The paper also introduces FusePlanner, which consists of cost models to estimate the global memory accesses of DW, PW, and FCM kernels given GPU characteristics. FusePlanner decides which layers benefit from fusion and the optimal FCM parameters to minimize global memory accesses.
The evaluation on three GPUs using representative CNNs and ViTs shows that FCMs achieve up to 3.7x speedup over cuDNN and save up to 83% of the global memory accesses compared to cuDNN. End-to-end implementations of four CNNs using the proposed kernels achieve up to 1.8x speedup compared to TVM implementations and save up to two-thirds of the energy per inference.
통계
Fused Convolutional Modules (FCMs) achieve up to 3.7x speedup over cuDNN.
FCMs save up to 83% of the global memory accesses compared to cuDNN.
End-to-end implementations of four CNNs using the proposed kernels achieve up to 1.8x speedup compared to TVM implementations.
End-to-end implementations of four CNNs using the proposed kernels save up to two-thirds of the energy per inference compared to TVM implementations.
인용구
"Fused Convolutional Modules (FCMs) significantly reduce pointwise and depthwise convolutions memory accesses, improving execution time and energy efficiency."
"FCMs achieve up to 3.7x speedup over cuDNN and save up to 83% of the global memory accesses compared to cuDNN."
"End-to-end implementations of four CNNs using the proposed kernels achieve up to 1.8x speedup compared to TVM implementations and save up to two-thirds of the energy per inference."