Acquiring Linguistic Knowledge Efficiency in Multimodal Models
The authors investigate the impact of multimodal input on language models' data efficiency, finding that vision does not consistently improve language performance. Their study suggests that current multimodal pretraining methods may not benefit from richer learning signals.