Robust and Biomimetic Visual Perception for Vision Transformers: TransNeXt
The core message of this article is that the authors propose a novel visual backbone called TransNeXt, which incorporates aggregated attention as a token mixer and convolutional GLU as a channel mixer. This approach closely aligns with biological foveal vision and mitigates potential model depth degradation, enabling TransNeXt to achieve state-of-the-art performance across multiple computer vision tasks.