แนวคิดหลัก
Integrating visual and textual prompts significantly improves multimodal large language models' ability to accurately perceive and reason about objects in visual question answering tasks.
บทคัดย่อ
The paper presents a novel approach called VTPrompt that enhances the object-oriented perception capabilities of multimodal large language models (MLLMs) like GPT-4V and Gemini Pro in visual question answering (VQA) tasks.
Key highlights:
- MLLMs often struggle with fine-grained understanding of object identities, locations, and attributes in VQA tasks, as indicated by empirical evaluations on benchmarks like MMB and MME.
- VTPrompt addresses this by jointly leveraging visual and textual prompts to guide MLLMs' perception and reasoning.
- The approach involves three main steps:
- Key Concept Extraction: Extracting key objects from the textual question using GPT-4.
- VPrompt Generation: Using the extracted key concepts to guide a detection model (e.g., SPHINX) to annotate the image with visual markers (bounding boxes).
- TPrompt for Answer Generation: Designing a structured text prompt to effectively leverage the annotated image alongside the original question for the MLLM to generate the final answer.
- Experiments on MMB, MME, and POPE benchmarks demonstrate significant performance improvements for both GPT-4V and Gemini Pro, setting new state-of-the-art records on MMB.
- Analysis reveals VTPrompt's ability to enhance object-oriented perception, reduce object hallucination, and improve overall multimodal reasoning capabilities of MLLMs.
สถิติ
GPT-4V's score on MME improved by 183.5 points with VTPrompt.
Gemini Pro's performance on MMB increased by 15.69% with VTPrompt.
VTPrompt boosted GPT-4V's object localization accuracy by 10.15%, spatial relationships by 10.74%, and attribute comparison by 19.15% on the MMB benchmark.
For Gemini Pro, VTPrompt led to an 18.09% improvement in object localization, 35.03% in spatial relationships, and 16.31% in attribute comparison on MMB.
คำพูด
"Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings."
"VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers."