Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment
The author proposes CLFA, a novel approach for multi-modal feature alignment using CLIP as a teacher model. This method significantly improves performance on multi-modal sarcasm detection and sentiment analysis tasks.