The paper introduces GROUNDHOG, a novel multimodal large language model (MLLM) that grounds text to pixel-level segmentation masks of visual entities. Unlike previous MLLM approaches that rely on bounding boxes, GROUNDHOG utilizes a masked feature extractor to convert class-agnostic entity masks into visual tokens, which are then connected to groundable phrases by the MLLM backbone.
The key highlights are:
Pixel-level grounding: GROUNDHOG enables unprecedented pixel-level vision-language alignment, going beyond the limitations of bounding box-based grounding.
Holistic segmentation: GROUNDHOG leverages a multi-grained segmentation model to propose entity masks covering a diverse range of visual semantics, including instances, stuff, parts, and text.
Interpretable grounding: The decoupled design of mask proposal and language grounding provides transparency and easy-to-understand diagnosis when grounding fails.
Comprehensive dataset: The authors curated the M3G2 dataset, a 2.5M text-image pair dataset with diverse grounding annotations across four task types, to train GROUNDHOG.
Experiments show that GROUNDHOG achieves superior performance on various grounding tasks without task-specific fine-tuning, and significantly reduces object hallucination compared to previous MLLM approaches.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Yichi Zhang,... lúc arxiv.org 04-17-2024
https://arxiv.org/pdf/2402.16846.pdfYêu cầu sâu hơn