Core Concepts
The author evaluates the performance of state-of-the-art MLLMs on the NLVR challenge, highlighting their poor performance in spatial and compositional reasoning tasks.
Abstract
The study assesses GPT-4V, Gemini Pro, and IDEFICS on the NLVR task, revealing their subpar performance due to challenges in spatial and compositional reasoning. Despite various prompting approaches, human accuracy remains unattainable. Fine-tuning IDEFICS shows improvement but leaves room for enhancement.
Stats
GPT-4V Zero-shot: 59.9% accuracy with 2195 TP, 1141 FN, 1239 FP, 1365 TN
Gemini Pro Zero-shot: 49.9% accuracy with 659 TP, 2677 FN, 300 FP, 2304 TN
IDEFICS Zero-shot: 55.9% accuracy with 3271 TP, 65 FN, 2555 FP, 49 TN
GPT-4V Five-shot: 58.0% accuracy with 2248 TP, 1088 FN, 1406 FP, 1198 TN
Gemini Pro Five-shot:51.5% accuracy with1363 TP,1973 FN ,907 FP ,1697 TN
IDEFICS Five-shot:45.1% accuracy with738 TP ,2598 FN ,664 FP ,1940 TN
IDEFICS Fine-tuned:59.7% accuracy with2144 TP ,1200 FN ,1192 FP ,1404 TN
Quotes
"We found that with only prompting models remain far from human accuracy."
"Fine-tuning the open-source IDEFICS model improved performance but there is still room for improvement."