In this study, the authors address the limitations of traditional facial expression recognition systems that focus on six basic expressions by introducing compound expressions. These compound expressions consist of combinations of basic emotions and are crucial for understanding human emotions in real-world scenarios. The lack of comprehensive training datasets for compound expressions led to the proposal of a zero-shot approach leveraging a pretrained visual language model and CNN networks. The authors participated in the 6th ABAW Challenge, where they were provided with unlabeled datasets containing compound expressions to develop their recognition system. By utilizing the C-EXPR-DB database, which includes videos annotated with 12 compound expressions, they focused on seven specific compound expressions for their challenge. The integration of large-scale visual language pre-trained models like Claude3 enhanced their recognition capabilities significantly. The methodology involved annotating unlabeled data, training CNN classification networks, and fine-tuning them using labels generated by the visual language model. The implementation details included data processing steps like face detection and alignment using Retinaface and utilizing various CNNs such as mobilenetV2, resnet152, densenet121, resnet18, and densenet201. Evaluation metrics were based on F1 Score across all seven compound expressions to assess performance accurately.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Jiahe Wang,J... um arxiv.org 03-19-2024
https://arxiv.org/pdf/2403.11450.pdfTiefere Fragen