Core Concepts
ViM-UNet, a novel segmentation architecture based on the Vision Mamba (ViM) architecture, performs similarly or better than the popular UNet model and outperforms the UNETR transformer-based model, while being more computationally efficient.
Abstract
The paper introduces ViM-UNet, a novel segmentation architecture based on the Vision Mamba (ViM) architecture, and compares its performance to the UNet and UNETR models for two challenging microscopy instance segmentation tasks.
Key highlights:
UNet is the default architecture for biomedical segmentation, while transformer-based approaches like UNETR have been proposed to provide a global field of view.
However, transformer-based models suffer from larger runtimes and higher parameter counts.
The recently proposed Vision Mamba (ViM) architecture offers a compelling alternative to transformers, providing a global field of view at higher efficiency.
The authors introduce ViM-UNet, a novel segmentation architecture based on ViM, and compare it to UNet and UNETR for cell segmentation in phase-contrast microscopy (LIVECell) and neurite segmentation in volume electron microscopy (CREMI).
For the LIVECell dataset, ViM-UNet performs similarly to UNet, while UNETR underperforms. For the CREMI dataset, ViM-UNet outperforms both UNet and UNETR.
The authors also analyze the inference times and memory requirements, finding that ViM-UNet is more efficient than UNETR.
The results suggest that ViM-UNet is a promising architecture for biomedical image analysis, especially for applications where a large context is important, such as 3D segmentation or cell tracking.
Stats
The number of parameters for the different models are:
UNet: 28M
UNETRBase: 113M
UNETRLarge: 334M
UNETRHuge: 665M
ViM-UNetTiny: 18M
ViM-UNetSmall: 39M
The required VRAM for training the models are:
UNet: ≤4GB
UNETRBase: ≤24GB
UNETRLarge: ≤38GB
UNETRHuge: ≤48GB
ViM-UNetTiny: ≤9GB
ViM-UNetSmall: ≤10GB
The inference times per image (in seconds) are:
LIVECell:
UNet: 0.02 (1.2e-4)
UNETRBase: 0.15 (3.3e-4)
UNETRLarge: 0.32 (4.9e-4)
UNETRHuge: 0.54 (4.6e-4)
ViM-UNetTiny: 0.05 (2.7e-3)
ViM-UNetSmall: 0.05 (4.6e-3)
CREMI:
UNet: 0.30 (1.8e-2)
UNETRBase: 1.37 (1.8e-2)
UNETRLarge: 2.95 (3.4e-2)
UNETRHuge: 4.86 (3.8e-2)
ViM-UNetTiny: 0.74 (3e-2)
ViM-UNetSmall: 0.82 (2.4e-2)