Core Concepts
Effective fusion model for dimensional emotion recognition using recursive cross-modal attention.
Abstract
Abstract:
Multi-modal emotion recognition gains attention.
Existing methods lack effective cross-modal relationships.
Recursive Cross-Modal Attention (RCMA) proposed.
Method:
Visual Network:
Resnet-50 used for facial expression modeling.
TCN captures temporal dynamics efficiently.
Audio Network:
Spectrograms utilized with VGG architecture.
TCN models temporal relationships in vocal signals.
Text Network:
BERT features employed with TCN networks.
Recursive Cross-Modal Attention:
Intermodal relationships captured effectively.
Attention weights computed across modalities recursively.
Experimental Setup:
Dataset:
Affwild2 dataset used for validation.
Subject-independent partitioning ensures data integrity.
Implementation Details:
Dropout regularization and data augmentation techniques applied.
Results:
Proposed fusion model outperforms state-of-the-art methods.
References:
Previous works on affective behavior analysis cited.
Stats
In total, there are 2, 993, 081 frames with 584 subjects, out of which 277 are male and 178 female. The annotations for valence and arousal are provided continuously in the range of [-1, 1].
The number of epochs is set to 100, and early stopping is used to obtain weights of the best network.
Quotes
"By deploying the cross-modal attention in a recursive fashion, we are able to achieve better results than that of the relevant methods on the validation set of Affwid2."
"The proposed fusion model can be further improved using the fusion of multiple A and V backbones either through feature-level or decision-level fusion similar to that of the winner of the challenge."