핵심 개념
Many-shot in-context learning significantly improves the performance of closed-weights multimodal foundation models, particularly Gemini 1.5 Pro, across diverse vision tasks, while open-weights models do not yet exhibit this capability.
초록
This research paper investigates the impact of many-shot in-context learning (ICL) on the performance of large multimodal models (LMMs). The authors benchmark the performance of three closed-weights LMMs (GPT-4o, GPT4(V)-Turbo, and Gemini 1.5 Pro) and two open-weights LMMs (Llama3.2-11B-Vision and InternLM-XComposer-2.5) on 14 datasets spanning various vision domains and tasks.
Research Objective
The study aims to determine if providing LMMs with a large number of demonstrating examples during inference, without updating model parameters, can improve their performance on various vision tasks.
Methodology
The researchers evaluate the models' performance using standard metrics like accuracy, macro-averaged F1, and mean Intersection over Union (IoU) for different tasks. They experiment with increasing numbers of demonstrating examples, up to approximately 2,000, to assess the impact of many-shot ICL. Additionally, they explore the effects of batching multiple queries in a single prompt to reduce inference cost and latency.
Key Findings
- Closed-weights LMMs, especially Gemini 1.5 Pro, demonstrate substantial performance improvements with many-shot ICL compared to few-shot and zero-shot settings.
- Gemini 1.5 Pro exhibits a log-linear performance improvement with increasing demonstration examples on most datasets.
- Open-weights LMMs do not benefit from many-shot ICL, highlighting a performance gap compared to closed-weights models.
- Batching queries in many-shot ICL significantly reduces per-example latency and inference cost without compromising performance.
- Batching queries even improves zero-shot performance, potentially due to domain calibration, class calibration, and self-generated demonstrations.
Main Conclusions
Many-shot ICL is a promising approach to enhance the adaptability and performance of closed-weights LMMs on new tasks and domains without further training. The authors suggest that future research should focus on bridging the performance gap between open and closed-weights models in many-shot ICL.
Significance
This research significantly contributes to the field of multimodal learning by demonstrating the potential of many-shot ICL for adapting large models to new tasks without fine-tuning. This capability has significant practical implications, making LMMs more versatile and accessible for various applications.
Limitations and Future Research
The study is limited by the context window size of current LMMs, restricting the number of demonstrating examples usable for tasks with many classes. Future research could explore techniques to overcome this limitation. Additionally, investigating the generalizability of these findings to other multimodal tasks and comparing many-shot ICL with fine-tuning in terms of performance and data efficiency are promising research avenues.
통계
Gemini 1.5 Pro performance improves log-linearly up to ~1,000 examples on 8 out of 14 datasets.
Gemini 1.5 Pro shows a performance increase of +23% accuracy on HAM10000 compared to zero-shot and +16% compared to 7 examples.
Gemini 1.5 Pro shows a performance increase of +29% accuracy on FIVES compared to zero-shot and +27% compared to 20 examples.
Gemini 1.5 Pro shows a performance increase of +38% accuracy on EuroSAT compared to zero-shot and +31% compared to 10 examples.
Both Gemini 1.5 Pro and GPT-4o achieve an average improvement of +17% accuracy at the optimal demo set size.
Batching up to 50 queries reduces per-example latency by nearly 35x and cost by 10x for HAM10000 with many-shot ICL.
Batching up to 50 queries reduces per-example latency by 20x and cost by 45x for TerraIncognita with many-shot ICL.
인용구
"We show that providing close-weights multimodal foundation models with many demonstrating examples leads to substantial performance improvements compared to providing only a few demonstrating examples."
"We find open-weights multimodal foundation models like Llama 3.2-Vision and InternLM-XComposer2.5 do not benefit from the demonstrating examples, highlighting a significant gap and an important direction for the open-weights community."
"We demonstrate that batching multiple queries into a single request can achieve similar or better performance than single query requests in a many-shot setting, while enabling substantially lower per-example latency and much cheaper per-example inference cost."