Core Concepts
Multimodal foundation models exhibit a consistent preference towards textual representations over visual representations when solving the same problems, in contrast with known human preferences.
Abstract
The paper introduces IsoBench, a benchmark dataset for evaluating multimodal foundation models on problems with isomorphic representations (i.e., the same problem presented in different modalities such as text, images, and mathematical expressions). The key findings are:
Across various multimodal foundation models, including GPT-4, Claude, and Gemini, the models perform substantially better on text-only prompts compared to image-based prompts, even when the information content is the same. This is in contrast with the known human preference for visual representations over textual ones.
The performance gap between text and image representations can be as large as 28.7 percentage points, suggesting that the multimodal fusion components of these models may not be fully leveraging the visual information.
The paper introduces two prompting techniques, IsoCombination and IsoScratchPad, which can improve model performance by considering combinations of, and translations between, different input representations. These techniques help bridge the performance gap between text and image inputs in certain settings.
IsoBench covers a broad range of domains, including mathematics, science, algorithms, and chess, with each example provided in multiple isomorphic representations. This allows for fine-grained diagnosis of model capabilities and limitations across different input modalities.
Stats
On the mathematics problems, GPT-4 Turbo performs 29.7 points worse when provided with images instead of text.
On the science problems, Claude-3 Opus performs 18.7 points worse when provided with images instead of text.
On the graph algorithm problems, GPT-4 Turbo performs 19.3 points worse on graph connectivity when provided with images instead of text.
Quotes
"Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations."
"Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4 Turbo is 18.7 points worse and Gemini Pro is 14.9 points worse."