Enhancing Multimodal Language Models' Pixel-Level Understanding through Visual Prompting
SPHINX-V, a new multimodal large language model, leverages visual prompts to enable fine-grained pixel-level understanding of images across diverse domains, outperforming existing methods in tasks like referring object classification, region-level captioning, and complex reasoning.