Core Concepts
Empowering MLLMs with segmentation ability while preserving dialogue and reasoning skills.
Abstract
The content discusses the extension of Multi-modal Large Language Models (MLLMs) with segmentation ability. It introduces a novel framework, LLaVASeg, that leverages chain-of-thought prompting to instruct MLLMs to segment target regions queried by users. The method maintains the original dialogue ability of MLLMs while enhancing their segmentation performance. Experiments demonstrate the effectiveness of LLaVASeg in achieving superior segmentation and dialogue results compared to previous methods like LISA.
Directory:
Abstract
Extension of MLLMs with segmentation ability.
Introduction of LLaVASeg framework using chain-of-thought prompting.
Introduction
Large Language Models (LLMs) scaling up data and model size.
Emergence of chatbots based on LLMs.
Related Work
Overview of multi-modal large language models.
Previous works on enabling MLLMs for vision tasks.
Method
Detailed explanation of the proposed LLaVASeg framework.
Experiment
Evaluation metrics for reasoning segmentation performance.
Ablation Study
Impact analysis of different prompts and multi-scale features on performance.
Conclusion
Limitations, future works, and conclusion.
Stats
Although they achieve superior segmentation performance, we observe that the dialogue ability decreases by a large margin compared to the original MLLMs.
Experiments show that the proposed method keeps the original dialogue ability and equips the MLLMs’ model with strong reasoning segmentation ability.
Quotes
"Our method performs competitive segmentation ability by using off-the-shelf MLLMs."
"We propose a novel chain-of-thought prompting strategy that iteratively prompts MLLMs to generate image-specific textual attributes for prompt the segmentation model."