Explicitly training visual-LLMs to process and generate textual image-space coordinates improves their spatial reasoning abilities, leading to better performance on vision-language tasks.
Combining features from multiple vision encoders with different biases into a versatile and compact visual representation can lead to state-of-the-art performance on a wide range of captioning and visual question answering tasks, while also significantly improving robustness against visual hallucinations and out-of-distribution inputs.
Our framework leverages vision-language pre-trained models to generate scene graphs with both known and novel visual relation concepts, outperforming previous methods on open-vocabulary scene graph generation benchmarks.