Multimodal In-Context Learning

Accedi

approfondimento - Multimodal In-Context Learning

Unified Multimodal In-Context Visual Understanding Framework

Advancing unified multimodal in-context learning for visual understanding tasks.

Analyzing Multimodal In-Context Learning for Vision & Language Models

Improving Vision & Language Models through ICL instruction tuning.

Investigating the Impact of Textual Information on Multimodal In-Context Learning

Textual information plays a crucial role in improving the performance of multimodal in-context learning, both in unsupervised and supervised retrieval of in-context examples.

Investigating the Influence of Modalities on Multimodal In-Context Learning Performance

Multimodal in-context learning (M-ICL) primarily relies on text-driven mechanisms, with little to no influence from the image modality. Advanced M-ICL strategies like RICES do not outperform a simple majority voting approach over the context examples.

Benchmarking Many-Shot In-Context Learning with Closed and Open-Weights Multimodal Foundation Models

Many-shot in-context learning significantly improves the performance of closed-weights multimodal foundation models, particularly Gemini 1.5 Pro, across diverse vision tasks, while open-weights models do not yet exhibit this capability.

The Impact of Modality on In-Context Learning for Multimodal Large Language Models

Multimodal Large Language Models (LLMs) demonstrate varying reliance on visual and textual modalities during in-context learning (ICL), impacting performance across tasks and necessitating modality-aware demonstration selection strategies.