This work performs an extensive robustness analysis of Visual Foundation Models (VFMs) and state-of-the-art unimodal segmentation models against real-world inspired perturbations. The authors use two benchmark datasets, MS COCO-P and ADE20K-P, which are created by applying 17 different corruptions at 5 severity levels to the original images.
The key findings are:
VFMs exhibit vulnerabilities to compression-induced corruptions, struggling with blur and compression-based perturbations.
While multimodal VFMs are not typically more robust or higher-performing than unimodal models, they show competitive robustness in zero-shot scenarios, maintaining consistent performance across different corruption categories.
Certain object categories, such as those found in "appliance", "furniture", "outdoor" and "sports", demonstrate enhanced relative robustness for multimodal VFMs compared to unimodal models.
The authors observe that models which encode corrupted images closer to the original in the latent space tend to be more robust.
The authors hope these findings and the benchmark dataset will encourage further advancements in building more robust foundational segmentation models.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Madeline Cha... at arxiv.org 04-30-2024
https://arxiv.org/pdf/2306.09278.pdfDeeper Inquiries