核心概念
CoMM, a contrastive multimodal learning strategy, enables the communication between modalities in a single multimodal space, allowing it to capture redundant, unique and synergistic information between modalities.
要約
The content discusses a novel contrastive multimodal learning strategy called CoMM that aims to capture multimodal interactions beyond just redundancy.
The key highlights are:
-
Existing multimodal contrastive learning methods are limited to learning only the redundant information between modalities, as they rely on the multiview redundancy assumption.
-
CoMM proposes a multimodal architecture with specialized encoders and an attention-based fusion module to obtain a single multimodal representation.
-
The training objective of CoMM is designed to maximize the mutual information between augmented versions of the multimodal features, which allows it to naturally capture redundant, unique and synergistic information between modalities.
-
Theoretical analysis shows that the proposed formulation enables the estimation of these three types of multimodal interactions.
-
Experiments on a controlled synthetic dataset and real-world multimodal benchmarks demonstrate that CoMM effectively learns all three forms of multimodal interactions and achieves state-of-the-art results.
-
CoMM is a versatile framework that can handle any number of modalities, diverse data types, and different domains.
統計
"Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior."
"Contrastive learning offers an appealing solution for multimodal self-supervised learning."
"CoMM enables to model multimodal interactions –including redundancy, uniqueness and synergy– in the context of multimodal representation learning for the first time, as these terms naturally emerge from our contrastive multimodal formulation."
"CoMM achieves state-of-the-art results on seven multimodal tasks with two or three modalities."
引用
"Multimodal or multimodal learning (Baltruˇsaitis et al., 2018) involves extracting and processing information from multiple sources (or modalities, e.g. text, audio, images, tabular data, etc.) to perform a task."
"Modeling these interactions to perform multimodal learning is highly challenging as the interplay between R, S and U is task-dependent and difficult to measure in complex real-life scenarios."
"CoMM's formulation is well-aligned with the global workspace theory (Baars, 1988; Goyal & Bengio, 2022) in cognitive neuroscience, which considers the nervous system as a set of multiple specialized processors working in parallel and claims the existence of a shared representation, which can be modified by any selected processor and whose content is broadcast to all processors."