ViT-CoMer introduces a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction. It addresses limitations of ViT in dense prediction tasks by incorporating spatial pyramid convolutional features and bidirectional fusion interaction between CNN and transformer. The model achieves competitive results on COCO val2017 and ADE20K val datasets without extra training data.
The content discusses the challenges faced by Vision Transformers (ViTs) in dense prediction tasks due to limited local information interaction and single-feature representation. It proposes ViT-CoMer as a solution that integrates multi-receptive field feature pyramid module (MRFP) and CNN-Transformer bidirectional fusion interaction module (CTI). By leveraging open-source pre-trained weights, ViT-CoMer demonstrates improved performance across various dense prediction benchmarks.
The study compares ViT-CoMer with existing methods such as Swin, PVT, MixFormer, and others on object detection, instance segmentation, and semantic segmentation tasks. Results show that ViT-CoMer outperforms other backbones under similar model sizes and configurations. Ablation studies confirm the effectiveness of MRFP and CTI modules in enhancing plain ViTs for dense predictions.
Further experiments demonstrate the scalability of ViT-CoMer when applied to hierarchical vision transformers like Swin-T. Qualitative results showcase the model's ability to capture fine-grained multi-scale features for improved object localization. Overall, ViT-CoMer proves to be a promising backbone for dense prediction tasks in computer vision.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Chunlong Xia... at arxiv.org 03-13-2024
https://arxiv.org/pdf/2403.07392.pdfDeeper Inquiries