The paper introduces GROUNDHOG, a novel multimodal large language model (MLLM) that grounds text to pixel-level segmentation masks of visual entities. Unlike previous MLLM approaches that rely on bounding boxes, GROUNDHOG utilizes a masked feature extractor to convert class-agnostic entity masks into visual tokens, which are then connected to groundable phrases by the MLLM backbone.
The key highlights are:
Pixel-level grounding: GROUNDHOG enables unprecedented pixel-level vision-language alignment, going beyond the limitations of bounding box-based grounding.
Holistic segmentation: GROUNDHOG leverages a multi-grained segmentation model to propose entity masks covering a diverse range of visual semantics, including instances, stuff, parts, and text.
Interpretable grounding: The decoupled design of mask proposal and language grounding provides transparency and easy-to-understand diagnosis when grounding fails.
Comprehensive dataset: The authors curated the M3G2 dataset, a 2.5M text-image pair dataset with diverse grounding annotations across four task types, to train GROUNDHOG.
Experiments show that GROUNDHOG achieves superior performance on various grounding tasks without task-specific fine-tuning, and significantly reduces object hallucination compared to previous MLLM approaches.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yichi Zhang,... at arxiv.org 04-17-2024
https://arxiv.org/pdf/2402.16846.pdfDeeper Inquiries