Benchmarking Multimodal Foundation Models on Isomorphic Representations
Multimodal foundation models exhibit a consistent preference towards textual representations over visual representations when solving the same problems, in contrast with known human preferences.