核心概念
Atrous Attention mechanism in ACC-ViT enhances global context and hierarchical relations in vision transformers.
摘要
ACC-ViT introduces Atrous Attention, combining regional and sparse attention for improved information consolidation. Inspired by atrous convolution, it balances local and global information effectively. The model outperforms MaxViT on ImageNet-1K with fewer parameters. Evaluation across tasks like finetuning, linear probing, and zero-shot learning shows ACC-ViT's versatility. Ablation study highlights the importance of shared MLP layers and adaptive gating for performance improvement.
统计
Our tiny version model achieves ∼ 84% accuracy on ImageNet-1K.
ACC-ViT has been evaluated on tasks involving medical image analysis, object detection, and language-image contrastive learning.
ACC-ViT nano model has 16.77% less parameters than tiny-MOAT-3 but performs similarly.
引用
"Atrous Attention in ACC-ViT adaptsively consolidates both local and global information."
"ACC-ViT outperforms state-of-the-art models like MaxViT while having fewer parameters."