Core Concepts
This paper presents a multimodal fusion approach that leverages pre-trained model features to achieve outstanding performance in Valence-Arousal Estimation and Expression Recognition tasks on the Aff-Wild2 dataset.
Abstract
The authors propose a multimodal fusion approach that combines the advantages of pre-trained models and multimodal fusion techniques to address the tasks of Valence-Arousal (VA) Estimation and Expression (Expr) Recognition on the Aff-Wild2 dataset.
For the VA challenge, the authors utilize various preprocessing and feature extraction strategies for the audio, visual, and text modalities. They then employ multimodal fusion models, such as Multimodal Cyclic Translation Network (MCTN), Memory Fusion Network (MFN), and an attention-based network, to integrate the extracted features. The authors also incorporate techniques like pseudo-labeling and label smoothing to further enhance the model's performance.
For the Expr challenge, the authors use MobileNetV3 as the backbone and incorporate a Transformer encoder to learn robust temporal features. They employ a Residual Network to combine the backbone features and the Transformer encoder's output, which is then used for expression recognition.
The authors' proposed methods significantly outperform the baseline systems on both the VA and Expr challenges, demonstrating the effectiveness of their approach in leveraging pre-trained model features and multimodal fusion techniques for affective behavior analysis in the wild.
Stats
The Aff-Wild2 dataset used in this study contains 594 videos with approximately 3 million frames for the VA challenge and 548 videos with around 2.7 million frames for the Expr challenge.
Quotes
"Our code are avalible on https://github.com/FulgenceWen/ABAW6th."
"To guide the model training, the concordance correlation coefficient (CCC) loss is employed for Valence-Arousal (VA) Estimation."
"We employ F1-loss as the loss function. Empirically, we find that F1-loss usually performs better than the standard cross-entropy loss."