This paper provides a comprehensive overview of visual prompting methods in multimodal large language models (MLLMs). It categorizes different types of visual prompts, including bounding boxes, markers, pixel-level prompts, and soft visual prompts. The paper then discusses various visual prompt generation techniques, such as prompt engineering, visual segmentation, object detection, and learnable/soft visual prompts.
The survey also examines how visual prompting enhances MLLM's perception and reasoning capabilities. It covers improvements in visual grounding, referring, multi-image and video understanding, as well as 3D visual understanding. The paper highlights how visual prompting enables more controllable compositional reasoning in tasks like visual planning, reasoning, and action generation.
Finally, the survey summarizes model training and in-context learning methods that align MLLMs with visual prompts, addressing issues like hallucination and language bias. Overall, this paper provides a comprehensive review of the state-of-the-art in visual prompting for MLLMs and outlines future research directions in this emerging field.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Junda Wu, Zh... alle arxiv.org 09-25-2024
https://arxiv.org/pdf/2409.15310.pdfDomande più approfondite