Core Concepts
Combining features from multiple vision encoders with different biases into a versatile and compact visual representation can lead to state-of-the-art performance on a wide range of captioning and visual question answering tasks, while also significantly improving robustness against visual hallucinations and out-of-distribution inputs.
Abstract
The paper first conducts a comprehensive evaluation of several vision encoders with different inductive biases, such as training data, objective, and model size, on solving various vision-language tasks. The results show that there is no single encoder that consistently achieves top performance across tasks, and encoders with different biases can perform surprisingly similarly.
Motivated by these findings, the authors introduce a method called BRAVE that consolidates features from multiple frozen vision encoders into a more versatile and compact visual representation. BRAVE uses a lightweight multi-encoder querying transformer (MEQ-Former) to efficiently resample the visual features from different encoders and feed them as a soft visual prompt to a frozen language model.
BRAVE achieves state-of-the-art performance on a broad range of captioning and visual question answering benchmarks, including COCO, NoCaps, VQAv2, OKVQA, GQA, VizWiz-QA, MMVP, and POPE. It also significantly reduces the issues of visual hallucinations and out-of-distribution failures that commonly plague vision-language models. Importantly, BRAVE achieves these improvements while using a smaller number of trainable parameters compared to existing methods.
The paper also provides a comprehensive ablation study to analyze the impact of different design choices in BRAVE, such as the contribution of individual vision encoders, the role of pre-training data, and the effectiveness of the MEQ-Former compared to a naive ensembling approach.
Stats
"BRAVE uses a total of 10.3B parameters, with 116M trainable parameters during pre-training."
"BRAVE is pre-trained on the WebLI dataset, which contains 100 million image-text pairs."
Quotes
"Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs."
"BRAVE effectively consolidates diverse visual signals into a broad and contextual representation, leading to consistently better performance over the state-of-the-art and improved robustness against out-of-distribution inputs."