Systematic Shortcomings in Visual Grounding of Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) exhibit systematic shortcomings in their visual grounding capabilities, stemming from limitations in the underlying CLIP vision encoders.