Enhancing Multimodal Large Language Models' Visual Reasoning Capabilities through Plug-and-Play Grounding
The authors propose P2G, a novel framework that leverages external agents to provide detailed textual and visual clues to enhance the grounding and factualness of reasoning in multimodal large language models, without relying on extensive supervised instruction-following data.