核心概念
Samba, a novel semantic segmentation framework built on the Mamba architecture, effectively captures global semantic information in high-resolution remotely sensed images with low computational complexity, outperforming state-of-the-art CNN and ViT-based methods.
摘要
The article introduces Samba, a semantic segmentation framework for high-resolution remotely sensed images, which is built upon the Mamba architecture. The key highlights are:
-
Limitations of existing methods:
- CNN-based methods struggle with the limited receptive field when handling high-resolution images.
- ViT-based methods face challenges in dealing with long sequences and require large amounts of training data.
-
Samba architecture:
- Samba utilizes an encoder-decoder architecture, with Samba blocks as the encoder and UperNet as the decoder.
- The Samba block replaces the multi-head self-attention in ViT with a Mamba block, which efficiently captures global semantic information using a State Space Model (SSM).
- The combination of the Mamba block and MLP enhances the model's representational capacity and learning ability for complex data.
-
Experiments and results:
- Samba is evaluated on the LoveDA dataset, a high-resolution remotely sensed imagery dataset.
- Samba outperforms top-performing CNN-based (ConvNeXt, ResNet50) and ViT-based (Swin-T) methods in terms of segmentation accuracy (mIoU) without using pre-trained parameters.
- Samba achieves a new benchmark in performance for Mamba-based techniques in semantic segmentation of remotely sensed images.
-
Potential future directions:
- Combining Mamba with CNNs to enhance the capability of capturing local details.
- Exploring efficient and effective transfer learning methods tailored to the Mamba architecture.
- Applying Mamba-based methods to semantic segmentation of multi-channel data, such as hyperspectral imagery.
統計資料
The article does not provide any specific numerical data or statistics. However, it presents the following key figures:
Figure 1(a) illustrates the limited receptive field of CNN, which becomes 7×7 after two 3×3 convolutions.
Figure 1(b) shows how ViT slices an image into patches and performs multi-head self-attention to possess a global receptive field.
Table 1 summarizes the training settings for the compared semantic segmentation networks, including decoder, encoder, image size, total training iterations, batch size, optimizer, initial learning rate, warmup iterations, learning rate schedule, weight decay, loss function, and data augmentation.
Table 2 presents the performance comparison of Samba and other methods on the LoveDA dataset, including mIoU, flops per patch, and number of parameters.
引述
The article does not contain any direct quotes that are particularly striking or support the key logics.