Recent advancements in state space models, particularly Mamba, have shown progress in long sequence modeling. However, their application in vision tasks lags behind CNNs and ViTs. The key to improving Vision Mamba lies in optimizing scan directions for sequence modeling. A novel local scanning strategy divides images into windows to capture local dependencies while maintaining a global perspective. A dynamic method searches for optimal scan choices for each layer, significantly enhancing performance. LocalVim and LocalVMamba outperform previous models on ImageNet by 3.1% with the same FLOPs.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы