Evaluating Multimodal Models' Ability to Generalize Compositionally in Sequential Activity Understanding
Multimodal models exhibit improved performance over text-only models in understanding and generating novel compositions of concepts derived from sequential multimodal inputs, highlighting the importance of leveraging multiple modalities for compositional generalization.