Core Concepts
Explicitly training visual-LLMs to process and generate textual image-space coordinates improves their spatial reasoning abilities, leading to better performance on vision-language tasks.
Abstract
The content discusses how existing visual-LLMs (V-LLMs) such as BLIP-2 and LLaVA exhibit weak spatial reasoning and localization awareness, despite their strong performance on visual question answering (VQA) tasks. To address this limitation, the authors propose a framework called LocVLM that injects textual spatial coordinate awareness into V-LLMs through novel instruction fine-tuning objectives.
The key highlights of the work are:
- Exploration of three different textual representations for image-space coordinates (normalized floating point values, integer valued binning, and deviation from image-grid based anchors), with integer valued binning showing optimal performance.
- Proposal of three instruction fine-tuning objectives: Location Prediction (LocPred), Negative Prediction (NegPred), and Reverse-Location Prediction (RevLoc), which directly train the V-LLM to process and generate textual coordinates.
- Introduction of a pseudo-data generation strategy that utilizes the V-LLM itself to create object-level captions and bounding boxes, expanding the training data beyond the COCO dataset.
- Extension of the framework to the video domain, with additional video-specific instruction fine-tuning objectives.
Experiments across 5 vision-language tasks involving 14 different datasets demonstrate clear performance improvements of LocVLM over existing V-LLMs, particularly in spatial reasoning, VQA, object hallucination reduction, and region description.
Stats
"The dog is located at (x1, y1, x2, y2) bbox."
"There's no obj2 in the image."
"The blue plaid blanket is behind the teddy bear."
Quotes
"We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs."
"Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions."