核心概念
Rigorous experimental analysis of key design choices in vision-language models, leading to the development of Idefics2 - an open, state-of-the-art 8B parameter VLM that outperforms larger models on various benchmarks.
摘要
The paper explores the design space of vision-language models (VLMs) through extensive experiments, focusing on two key areas: model architecture and multimodal training procedures.
Key Findings:
- The quality of the language model backbone has a higher impact on VLM performance than the vision backbone, for a fixed parameter count.
- The fully autoregressive architecture outperforms the cross-attention architecture when the pre-trained backbones are unfrozen, despite the cross-attention having more trainable parameters.
- Unfreezing the pre-trained backbones under the fully autoregressive architecture can lead to training divergences, which can be stabilized using Low-Rank Adaptation (LoRA).
- Reducing the number of visual tokens through learned pooling significantly improves compute efficiency at training and inference while improving downstream performance.
- Adapting a vision encoder pre-trained on fixed-size square images to preserve the original aspect ratio and resolution does not degrade performance while speeding up training and inference.
- Splitting images into sub-images during training allows trading compute efficiency for more performance during inference, particularly for tasks involving reading text in images.
Based on these insights, the authors train Idefics2 - an open 8B parameter VLM that achieves state-of-the-art performance in its size category across various benchmarks, while being more efficient at inference. Idefics2 is on par with larger models 4 times its size on several challenging tasks.
統計資料
The interest expense in 2024 is twice the interest expense in 2014.
The long-term debt in 2024 is 10% higher than the long-term debt in 2015.