The content discusses the importance of computational efficiency, in addition to model efficiency, for achieving high-performance convolutional neural networks (convnets). It introduces the concept of the "efficiency gap" - the difference between ideal and actual latency caused by poor computational efficiency.
The author first analyzes the operational intensity of different convnet layers, including full convolution, point-wise convolution, grouped convolution, and depth-wise convolution. This analysis shows that degenerate convolution layers like depth-wise have very low operational intensity, making them memory-bound and limiting their computational efficiency.
To address this issue, the author proposes "block fusion" - an optimization that implements all the layers within a residual block as a single kernel. This exploits temporal locality, avoids communication, and reduces workspace size, thereby improving computational efficiency.
The author develops a "tensor machine" abstraction to express and plan these block-fusion kernels. They then implement CUDA kernels for the ConvFirst and MBConv blocks and benchmark them, showing significant improvements in computational efficiency and latency compared to baseline models like EfficientNet and ConvNeXt.
The key insights are: 1) Computational efficiency, not just model efficiency, is crucial for high-performance convnets. 2) Degenerate convolution layers like depth-wise have low operational intensity, making them memory-bound. 3) Block fusion can improve computational efficiency by co-optimizing the model and kernels.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania