Core Concepts
Vision Language Models (VLMs) struggle with consistently applying mathematical reasoning across visually similar problems with minor variations, highlighting a significant limitation in their robustness despite impressive average-case performance.
Stats
GPT-4o, Claude-3.5, and Gemini Pro 1.5 achieved average accuracies above 60% on the DynaMath benchmark, with Claude-3.5 reaching the highest at 64.8%.
Human performance on the same benchmark was 75.8%, indicating a significant performance gap.
The worst-case accuracy for all models was substantially lower, with Claude-3.5 scoring 35.3% and open-source models like Qwen2-VL-72B scoring 28.3%.
GPT-4o, Gemini Pro 1.5, Qwen2-VL-72B, and InternVL2-76B exhibited consistent failure cases (RC = 1) on 21.8%, 18.4%, 29.9%, and 28.3% of the seed questions, respectively.
Repetition consistency (RC) for the tested models ranged from 92% to 99%, indicating that the low robustness is not primarily due to random errors.
Error analysis of Claude 3.5 Sonnet revealed that figure reading errors (32.2%) and reasoning errors (26.9%) were the most common failure modes.
Quotes
"While GPT-4o can give correct answers for some values of a, it consistently gives a wrong answer for many different values of a ≠ 0."
"Our evaluation results highlight the limited reasoning robustness of both open-source and closed-source models, underscoring the necessity for the community to address these limitations in future research."
"These examples highlight the unreliability of VLMs on mathematical reasoning tasks."