Integrating visual and textual prompts significantly improves multimodal large language models' ability to accurately perceive and reason about objects in visual question answering tasks.
Enhancing multimodal alignment through touch-language-vision datasets.