Belangrijkste concepten
Introducing ASAM, a novel framework that leverages adversarial tuning to significantly enhance the performance of the Segment Anything Model (SAM) across a diverse range of segmentation tasks without requiring substantial additional data or architectural changes.
Samenvatting
The paper introduces ASAM, a novel framework that aims to boost the generalization capabilities of the Segment Anything Model (SAM), a pioneering visual foundation model for image segmentation.
The key insights are:
- Inspired by the successes of adversarial training in natural language processing, the authors propose fine-tuning SAM using "natural" adversarial examples generated through a stable diffusion model.
- To create these natural adversarial examples, the authors project natural images onto a low-dimensional manifold using the stable diffusion model, and then optimize the latent representation to generate adversarial perturbations that are both photorealistic and aligned with the original mask annotations.
- The authors integrate a ControlNet module into the diffusion process to further enhance the spatial alignment between the generated adversarial examples and their corresponding mask labels.
- By fine-tuning only a small subset of SAM's parameters using this approach, the authors are able to achieve significant performance improvements across a diverse range of segmentation datasets and tasks, without compromising SAM's inherent generalization capabilities.
- The results demonstrate that ASAM outperforms the original SAM as well as other fine-tuning approaches, establishing new benchmarks in segmentation tasks.
Statistieken
ASAM achieves an average mIoU of 77.6% across 14 diverse segmentation datasets, outperforming the original SAM by 1.3 mIoU.
ASAM surpasses the original SAM's performance on all 14 test datasets.
Citaten
"Drawing inspiration from the successes in NLP, we introduce a novel framework, termed adversarial tuning, aimed at enhancing the generalization abilities of visual foundation models like SAM."
"By projecting natural images onto a low-dimensional manifold using a generative model, we generate adversarial examples that are both natural and photorealistic."
"Leveraging our approach, we fine-tune SAM with 'natural' adversarial examples, derived from just 1% of the SA-1B dataset, resulting in an enhanced version termed ASAM."