Recent advancements in state space models, particularly Mamba, have shown progress in long sequence modeling. However, their application in vision tasks lags behind CNNs and ViTs. The key to improving Vision Mamba lies in optimizing scan directions for sequence modeling. A novel local scanning strategy divides images into windows to capture local dependencies while maintaining a global perspective. A dynamic method searches for optimal scan choices for each layer, significantly enhancing performance. LocalVim and LocalVMamba outperform previous models on ImageNet by 3.1% with the same FLOPs.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Tao Huang,Xi... at arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09338.pdfDeeper Inquiries