Conceitos Básicos
This paper introduces V2M, a novel image representation learning framework that leverages a 2-Dimensional State Space Model (SSM) to effectively capture local spatial dependencies within images, leading to improved performance in image classification and downstream vision tasks compared to existing methods relying on 1D SSMs.
Estatísticas
V2M-T achieves a 6.4% increase in Top-1 accuracy compared to ResNet-18 and a 4.0% improvement over DeiT-T on ImageNet.
V2M-S* outperforms ResNet-50, RegNetY-4G, and DeiT-S, with respective increases of 5.7%, 2.9%, and 3.0% in Top-1 accuracy on ImageNet.
V2M-S* outperforms VMamba by 0.3 box AP and 0.2 mask AP under 1x schedule on COCO object detection and instance segmentation.
V2M-S* surpasses the VMamba-T baseline by 0.3 mIoU (SS) on ADE20K semantic segmentation.