Khái niệm cốt lõi
Multimodal learning models often overfit to a dominant modality, hindering performance; this paper introduces a Multi-Loss Balanced method to mitigate this issue by dynamically adjusting learning rates based on individual modality performance, leading to improved accuracy across various datasets and fusion techniques.
Thống kê
On CREMA-D, models with ResNet backbone encoders surpass the previous best by 1.9% to 12.4%.
Conformer backbone models deliver improvements ranging from 2.8% to 14.1% across different fusion methods on CREMA-D.
On AVE, improvements range from 2.7% to 7.7%.
On UCF101, gains reach up to 6.1%.