核心概念
The selection of connectors in Multimodal Large Language Models (MLLMs) significantly impacts performance, with feature-preserving connectors excelling in fine-grained perception tasks and feature-compressing connectors offering speed advantages in coarse-grained perception and reasoning tasks.
統計資料
Increasing the image resolution from 224 to 336 enhances performance across all connector types for all three tasks.
Further increasing the resolution from 336 to 448 yields only marginal performance gains.
For feature-preserving connectors, increasing the resolution from 224 to 336 results in improvements of 12.6% in fine-grained perception, 2.5% in coarse-grained perception, and 2.3% in reasoning tasks.
For feature-compressing connectors, the improvements are 13.9%, 9.2%, and 4.3% respectively.
When the resolution is increased from 336 to 448, the performance changes for feature-preserving connectors are 2.5%, 0.2%, and 0.6%, while for feature-compressing connectors, the changes are -0.5%, -1.0%, and 0.9%.
C-Abstractor reduces the training time by 80% in the pre-training stage and 51% in the fine-tuning stage compared to the two-layer MLP at a resolution of 448.
引述
"Our findings reveal that feature-preserving connectors excel in fine-grained perception tasks due to their ability to retain detailed visual information."
"In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in coarse-grained perception and reasoning tasks."
"These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures."