Centrala begrepp
Efficiently model global and local relationships in vision transformers through Hierarchical Multi-Head Self-Attention.
Sammanfattning
This paper introduces Hierarchical Multi-Head Self-Attention (H-MHSA) to address the computational complexity of Multi-Head Self-Attention in vision transformers. The H-MHSA approach divides input images into patches, computes self-attention hierarchically, and aggregates local and global features efficiently. Experimental results demonstrate the effectiveness of H-MHSA in various vision tasks.
- Introduction to Vision Transformers and challenges in applying transformers to vision tasks.
- Proposal of Hierarchical Multi-Head Self-Attention (H-MHSA) to address computational complexity.
- Explanation of the H-MHSA mechanism for local and global relationship modeling.
- Construction of Hierarchical-Attention-based Transformer Networks (HAT-Net) incorporating H-MHSA.
- Evaluation of HAT-Net in image classification, semantic segmentation, object detection, and instance segmentation tasks.
Statistik
Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically.
HAT-Net-Tiny, HAT-Net-Small, HAT-Net-Medium, and HAT-Net-Large outperform the second best results by 1.1%, 0.6%, 0.8%, and 0.6% in terms of top-1 accuracy, respectively.
Citat
"Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically."
"HAT-Net outperforms the second best results by 1.1%, 0.6%, 0.8%, and 0.6% in terms of top-1 accuracy."