Core Concepts
We propose SegNext, a next-generation interactive image segmentation approach that offers low latency, high quality, and diverse prompt support by incorporating dense representation and fusion of visual prompts into generalist models.
Abstract
The content discusses the challenges in designing an optimal architecture for interactive image segmentation that achieves low latency, high quality, and diverse prompt support. The authors observe that existing specialist and generalist models suffer from various limitations in this regard.
Specialist models, such as FocalClick and SimpleClick, experience high latency due to the joint encoding of image and visual prompts. Generalist models, exemplified by the Segment Anything Model (SAM), have recently achieved great success in addressing the latency and prompt diversity issues, but they still lag behind specialist models in terms of high-quality segmentation.
The authors hypothesize that the sparse representation of visual prompts, similar to the treatment of linguistic prompts, could potentially dampen the performance of generalist models for high-quality segmentation. They propose incorporating the dense design, common in specialist models, into generalist models to better preserve the detailed spatial attributes of visual prompts.
The authors introduce SegNext, which uses a three-channel dense map to represent five diverse visual prompts: clicks, boxes, polygons, scribbles, and masks. The dense map of visual prompts is encoded into the image embedding space, followed by element-wise addition and a lightweight fusion module.
Extensive evaluations on HQSeg-44K and DAVIS show that SegNext outperforms prior state-of-the-art methods, both quantitatively and qualitatively. The authors also conduct out-of-domain evaluations on medical image datasets, demonstrating the generalizability of their approach.
Stats
The authors report the following key metrics:
SAT Latency: Measures the latency for the Segment Anything Task (SAT), which is crucial for real-time applications.
mIoU: Measures the average intersection over union (IoU) given a fixed number of consecutive interactions.
NoC: Measures the number of clicks required to achieve a predefined IoU.
NoF: Measures the number of failure cases, where more than 20 clicks are required to achieve a predefined IoU.