A metric learning approach using Siamese Networks can efficiently model conversational context to achieve state-of-the-art performance on emotion recognition in dialogues.
TelME incorporates cross-modal knowledge distillation to transfer information from a powerful text-based teacher model to enhance the representations of weaker audio and visual modalities, and then fuses the multimodal features using an attention-based shifting approach to optimize emotion recognition.