Core Concepts
This paper presents a comprehensive review of recent advancements in designing efficient attention mechanisms within Vision Transformer (ViT) networks to enhance their performance and computational efficiency.
Abstract
The paper provides a thorough overview of the attention mechanism and its importance in computer vision tasks. It introduces a unified attention model and two taxonomies to categorize different attention mechanisms.
The main focus of the paper is on enhancing the efficiency of ViT networks by exploring various design techniques for the attention mechanism. The authors propose a novel taxonomy that categorizes ViT architectures based on their attention mechanism design:
Self-Attention Complexity Reduction: Techniques like windowing, reordering, and channel attention are used to reduce the computational complexity of the self-attention mechanism.
Hierarchical Transformer: Multi-scale feature representations are exploited to optimize image understanding and reduce computational costs.
Channel and Spatial Transformer: Strategies like transposing the output tensor and incorporating channel attention are used to regain global context after patch merging and windowed self-attention.
Rethinking Tokenization: Approaches that modify the token representation, such as adding more informative tokens or reducing redundant tokens, are explored.
Other: Diverse strategies like focal modulation, convolution integration, and deformable attention are also discussed.
The paper provides a comprehensive review of these design techniques, including their underlying principles, advantages, and limitations. It also discusses the real-world applications of efficient ViT models in various domains, such as image recognition, object detection, segmentation, and medical imaging.
Stats
The number of tokens N is reduced by the spatial reduction ratio R while the channel dimension is expanded by R in the Efficient Self-Attention (ESA) mechanism.
The computational complexity of ESA is reduced to O(N^2/R) compared to O(N^2) for unmodified self-attention.
Quotes
"Efficient attention normalizes the keys and queries first, then multiplies the keys and values, and lastly, the resulting global context vectors are multiplied by the queries."
"The keys and queries are transposed in cross-covariance attention, therefore the attention weights are based on the cross-covariance matrix."
"The main advantage of CrossViT is a more efficient model because the number of transformer encoders is small for the small branch patches."