Core Concepts
Multimodal models exhibit improved performance over text-only models in understanding and generating novel compositions of concepts derived from sequential multimodal inputs, highlighting the importance of leveraging multiple modalities for compositional generalization.
Abstract
The study aims to investigate the capability of multimodal models to exhibit sequential compositional generalization, which is the ability to understand and generate predictions about novel compositions of primitive elements derived from sequential multimodal inputs.
The authors introduce the COMPACT dataset, which is carefully curated from the EPIC-KITCHENS-100 dataset to ensure that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. This setup requires the models to exhibit systematic generalization when interpreting the evaluation set.
The authors benchmark several unimodal and multimodal models, including text-only, vision-language, audio-language, and models that combine multiple modalities, on two tasks: next utterance prediction and atom classification. The results show that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts, emphasizing the importance of multimodality for compositional generalization. However, all models struggle to master this challenge, indicating the formidable nature of the task.
Further analysis reveals that models perform significantly better on in-domain (non-compositional) data compared to the out-of-domain (compositional) data, highlighting the unique difficulty introduced by compositionality. This suggests that while models can recognize individual concepts, they struggle to effectively generalize and adapt to novel combinations of these primitives.
The authors conclude that the proposed COMPACT dataset and the associated tasks provide a valuable testbed for evaluating the compositional generalization capabilities of multimodal models, and they hope this work will stimulate further research in this direction.
Stats
"The training and evaluation sets have similar distributions of atomic concepts (verbs and nouns) but feature varied combinations of these concepts."
"The training and evaluation sets have an atom divergence (DA) < 0.02 and a compound divergence (DC) > 0.6, representing a sweet spot in terms of target distributions of atoms and compounds."
Quotes
"Humans possess a remarkable ability to rapidly understand new concepts by leveraging and combining prior knowledge. This compositional generalization allows for an understanding of complex inputs as a function of their constituent parts."
"Addressing the challenge of compositional generalization in the context of multimodal models is increasingly important with the recent advances in large multimodal foundation models, such as GPT-4, Flamingo, and IDEFICS."