The article introduces TransNeXt, a new visual backbone architecture that aims to address the limitations of existing vision transformer models. The key highlights are:
Pixel-focused attention: The authors propose a novel token mixer called pixel-focused attention, which simulates the continuous movement of the human eyeball and aligns with the focal perception mode of biological vision. This mechanism operates on a per-pixel basis, providing fine-grained attention to nearby features and coarse-grained attention to global features.
Aggregated attention: The authors further enhance pixel-focused attention by incorporating learnable query tokens and positional attention, creating a more diverse and effective attention mechanism called aggregated attention.
Convolutional GLU: The authors propose a new channel mixer called convolutional GLU, which integrates local feature-based channel attention to enhance model robustness. This mixer is more suitable for image tasks compared to traditional feed-forward networks.
Comprehensive evaluation: The authors extensively evaluate TransNeXt on various computer vision tasks, including image classification, object detection, and semantic segmentation. The results demonstrate that TransNeXt outperforms previous state-of-the-art models across multiple benchmarks, including ImageNet-1K, COCO, and ADE20K, while also exhibiting superior robustness on challenging test sets like ImageNet-A.
Efficient multi-scale inference: The authors show that TransNeXt can perform efficient multi-scale inference, outperforming pure convolutional models in both normal and linear inference modes. This is attributed to the effective design of the aggregated attention mechanism and the use of length-scaled cosine attention and extrapolative positional encoding.
Overall, the article presents a comprehensive and innovative approach to designing a robust and biomimetic visual backbone for vision transformers, which significantly advances the state-of-the-art in computer vision.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Dai Shi om arxiv.org 04-01-2024
https://arxiv.org/pdf/2311.17132.pdfDiepere vragen