Broadening the Visual Encoding Capabilities of Vision-Language Models
Combining features from multiple vision encoders with different biases into a versatile and compact visual representation can lead to state-of-the-art performance on a wide range of captioning and visual question answering tasks, while also significantly improving robustness against visual hallucinations and out-of-distribution inputs.