Enhancing Multimodal Large Language Models' Object-Centric Perception through Joint Visual and Text Prompting
Integrating visual and textual prompts significantly improves multimodal large language models' ability to accurately perceive and reason about objects in visual question answering tasks.