GROUNDHOG: A Multimodal Language Model for Pixel-Level Grounding of Text to Visual Entities
GROUNDHOG is a multimodal large language model that grounds text to pixel-level segmentation masks of visual entities, enabling fine-grained vision-language alignment and interpretable grounding.