Core Concepts
This research proposes a novel regularization technique that encourages multimodal models to effectively utilize information from all modalities when making decisions, mitigating the issues of modality-failure and modality dominance.
Abstract
The content discusses the challenges faced in multimodal machine learning, where unimodal models often outperform their multimodal counterparts despite having access to richer information. The key issues identified are:
Modality-failure: The training process results in only one modality's encoders being trained to their maximum potential, while the encoders of other modalities remain suboptimal.
Modality dominance: Multimodal models tend to overly rely on a single modality when making decisions, essentially ignoring the contributions of other modalities.
To address these challenges, the research proposes a novel approach that utilizes attribution-based techniques to design a regularization term. This regularization term is incorporated into the classifier and fusion parts of the multimodal model, encouraging it to pay attention to information from all modalities when making decisions.
The proposed approach is evaluated on the VGGSound and CREMA-D datasets for video classification tasks. The results show that the inclusion of the regularization term has minimal/no improvements in performance when measured by conventional evaluation metrics like accuracy and mean Average Precision (mAP). However, the authors acknowledge that the impact of the regularization term may not be adequately captured by these metrics alone, and further investigation is required to develop and employ evaluation techniques that can effectively assess the benefits of equal attribution facilitated by the regularization term.
The authors remain optimistic that through other evaluation metrics and replication of experiments on the CREMA-D dataset, they will gain a comprehensive understanding of the impact and potential benefits of their regularization technique in multimodal machine learning.