TelME: A Teacher-led Multimodal Fusion Network for Emotion Recognition in Conversations
TelME incorporates cross-modal knowledge distillation to transfer information from a powerful text-based teacher model to enhance the representations of weaker audio and visual modalities, and then fuses the multimodal features using an attention-based shifting approach to optimize emotion recognition.