Conceitos essenciais
The author highlights the challenges faced in multilingual visual reasoning and proposes targeted interventions to address them effectively.
Resumo
The content discusses the evaluation of multilingual, multimodal models for visual reasoning tasks. It identifies key challenges such as multilinguality, complex reasoning, and multimodality. The proposed interventions aim to improve open model performance in a zero-shot setting.
The study compares proprietary systems like GPT-4V with open models like LLaVA, mBLIP, and CCLM on tasks involving reasoning over texts and image pairs. GPT-4V outperforms open models significantly but lags behind human performance across languages.
Key findings include disparities in model performance across languages and cultures, with a focus on equitable system development. The analysis reveals the need for advancements in open models to bridge the gap with proprietary systems.
Interventions proposed include a translate-test approach for multilinguality, visual programming for complex reasoning breakdown, and leveraging image captioning for multimodality. These interventions lead to improved open model performance on visual reasoning tasks.
Estatísticas
GPT-4V achieves best performance on MaRVL with 82.1% accuracy.
mBLIP shows better performance post-finetuning on NLVR2 than GPT-4V.
LLaVA's performance improves by 13.4% on MaRVL after interventions.
Citações
"Models have better visual reasoning capabilities with English inputs but lag behind with multilingual text."
"GPT-4V exhibits consistent performance across all languages, surpassing some results in English."
"Open models face challenges bridging the gap between proprietary systems in visual reasoning tasks."