핵심 개념
Multimodal learning models often overfit to a dominant modality, hindering performance; this paper introduces a Multi-Loss Balanced method to mitigate this issue by dynamically adjusting learning rates based on individual modality performance, leading to improved accuracy across various datasets and fusion techniques.
통계
On CREMA-D, models with ResNet backbone encoders surpass the previous best by 1.9% to 12.4%.
Conformer backbone models deliver improvements ranging from 2.8% to 14.1% across different fusion methods on CREMA-D.
On AVE, improvements range from 2.7% to 7.7%.
On UCF101, gains reach up to 6.1%.