Conceitos essenciais
Multimodal learning models can be made robust to missing modalities through a simple and parameter-efficient adaptation procedure that modulates the intermediate features of available modalities to compensate for the missing ones.
Resumo
The key highlights and insights from the content are:
Multimodal learning (MML) models often suffer significant performance degradation when one or more input modalities are missing at test time. Existing approaches to address this issue either require specialized training strategies or additional models/sub-networks, which are not feasible for practical deployment.
The authors propose a parameter-efficient adaptation procedure to enhance the robustness of pretrained MML models to handle missing modalities. The adaptation is achieved by inserting lightweight, learnable layers that modulate the intermediate features of the available modalities to compensate for the missing ones.
The adapted models demonstrate notable performance improvements over the original MML models when tested with missing modalities, and in many cases, outperform or match the performance of models trained specifically for each modality combination.
The adaptation approach is versatile and can be applied to a wide range of MML tasks and datasets, including semantic segmentation, material segmentation, action recognition, sentiment analysis, and classification.
The adaptation requires a very small number of additional learnable parameters (less than 1% of the total model parameters), making it computationally efficient and practical for real-world deployment.
Experiments show that the adapted models are able to learn better feature representations to handle missing modalities, as evidenced by higher cosine similarity between the features of the adapted model and the complete modality features compared to the pretrained model.
Overall, the proposed parameter-efficient adaptation approach offers a promising solution to enhance the robustness of MML models to missing modalities without the need for specialized training or additional models.
Estatísticas
The authors provide the following key statistics and figures to support their findings:
On the MFNet dataset for multimodal semantic segmentation, the adapted model shows a 15.41% and 1.51% improvement in performance compared to the pretrained model when Thermal and RGB are missing, respectively.
On the NYUDv2 dataset, the adapted model shows a 31.46% and 1.63% improvement in performance compared to the pretrained model when RGB and Depth are missing, respectively.
On the MCubeS dataset for multimodal material segmentation, the adapted model shows a 8.11% to 1.82% improvement in performance compared to the pretrained model for different missing modality scenarios.
On the NTU RGB+D dataset for multimodal action recognition, the adapted model shows a 7.03% and 1.06% improvement over the state-of-the-art ActionMAE and UMDR models when RGB and Depth are missing, respectively.
On the UPMC Food-101 dataset for multimodal classification, the adapted model outperforms the prompt-based approach by 1.29% on average.