Improving Spatial Reasoning in Visual Large Language Models through Localization-Focused Training
Explicitly training visual-LLMs to process and generate textual image-space coordinates improves their spatial reasoning abilities, leading to better performance on vision-language tasks.