Core Concepts
VLMs can improve their semantic grounding performance by receiving and generating feedback, without requiring in-domain data, fine-tuning, or modifications to the network architectures.
Abstract
The paper explores whether Vision-Language Models (VLMs) can improve their semantic grounding abilities by receiving and generating feedback. The key findings are:
VLMs can read feedback to improve their downstream semantic grounding performance. Providing noise-free binary feedback or class label feedback can improve grounding accuracy by up to 12 and 61 points, respectively, in a single step. Over multiple rounds, the improvements can exceed 15 accuracy points.
VLMs can serve as binary feedback providers, but similar to Large Language Models (LLMs), they struggle with intrinsic self-correction. This issue can be mitigated by using external techniques like visual prompting, where the VLM verifier receives a modified version of the input image to conduct binary verification.
An automated iterative feedback-based mechanism, where the VLM alternates between receiving and generating feedback, can improve semantic grounding accuracy by up to nearly 5 points. This is in contrast with prior work on intrinsic self-correction in LLMs, which can decrease performance by up to 10 points.
The paper concludes that feedback-based reasoning can be a promising approach to enhance semantic grounding in VLMs, without requiring expensive retraining, data, or architectural changes. However, the approach trades performance with compute, making it less practical for low-latency applications.
Stats
"VLMs can improve their semantic grounding performance by up to 61 accuracy points with noise-free class label feedback in a single step."
"VLMs can improve their semantic grounding performance by over 15 accuracy points across five rounds of noise-free feedback."
"VLMs can improve their semantic grounding performance by up to nearly 5 accuracy points using automated binary feedback."
Quotes
"VLMs can read the feedback to improve downstream semantic grounding."
"VLMs can be used as binary feedback providers, but similar to LLMs, they struggle to correct themselves out-of-the-box."
"VLMs benefit from automatic iterative feedback by improving semantic grounding accuracy up to nearly 5 accuracy points."