Advancing unified multimodal in-context learning for visual understanding tasks.
Improving Vision & Language Models through ICL instruction tuning.
Textual information plays a crucial role in improving the performance of multimodal in-context learning, both in unsupervised and supervised retrieval of in-context examples.
Multimodal in-context learning (M-ICL) primarily relies on text-driven mechanisms, with little to no influence from the image modality. Advanced M-ICL strategies like RICES do not outperform a simple majority voting approach over the context examples.
Many-shot in-context learning significantly improves the performance of closed-weights multimodal foundation models, particularly Gemini 1.5 Pro, across diverse vision tasks, while open-weights models do not yet exhibit this capability.
Multimodal Large Language Models (LLMs) demonstrate varying reliance on visual and textual modalities during in-context learning (ICL), impacting performance across tasks and necessitating modality-aware demonstration selection strategies.