Keskeiset käsitteet
ViT-CoMer enhances the ViT backbone by integrating multi-scale convolutional features, improving performance in dense prediction tasks.
Tiivistelmä
ViT-CoMer introduces a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction. It addresses limitations of ViT in dense prediction tasks by incorporating spatial pyramid convolutional features and bidirectional fusion interaction between CNN and transformer. The model achieves competitive results on COCO val2017 and ADE20K val datasets without extra training data.
The content discusses the challenges faced by Vision Transformers (ViTs) in dense prediction tasks due to limited local information interaction and single-feature representation. It proposes ViT-CoMer as a solution that integrates multi-receptive field feature pyramid module (MRFP) and CNN-Transformer bidirectional fusion interaction module (CTI). By leveraging open-source pre-trained weights, ViT-CoMer demonstrates improved performance across various dense prediction benchmarks.
The study compares ViT-CoMer with existing methods such as Swin, PVT, MixFormer, and others on object detection, instance segmentation, and semantic segmentation tasks. Results show that ViT-CoMer outperforms other backbones under similar model sizes and configurations. Ablation studies confirm the effectiveness of MRFP and CTI modules in enhancing plain ViTs for dense predictions.
Further experiments demonstrate the scalability of ViT-CoMer when applied to hierarchical vision transformers like Swin-T. Qualitative results showcase the model's ability to capture fine-grained multi-scale features for improved object localization. Overall, ViT-CoMer proves to be a promising backbone for dense prediction tasks in computer vision.
Tilastot
Our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data.
Our ViT-CoMer-L attains 62.1% mIoU on ADE20K val.
The proposed MRFP module provides rich multi-scale information.
CTI fuses multi-scale features from CNN and Transformer.
N=4 bidirectional fusion interaction modules perform best.
Lainaukset
"Our main contributions are proposing a novel dense prediction backbone by combining plain ViT with CNN features."
"We evaluate our proposed ViT-CoMer on several challenging dense prediction benchmarks."
"Notably, our approach can achieve superior performance compared to both plain and adapted backbones."