Advancing unified multimodal in-context learning for visual understanding tasks.
Improving Vision & Language Models through ICL instruction tuning.
Textual information plays a crucial role in improving the performance of multimodal in-context learning, both in unsupervised and supervised retrieval of in-context examples.
Multimodal in-context learning (M-ICL) primarily relies on text-driven mechanisms, with little to no influence from the image modality. Advanced M-ICL strategies like RICES do not outperform a simple majority voting approach over the context examples.