In this study, the authors address the limitations of traditional facial expression recognition systems that focus on six basic expressions by introducing compound expressions. These compound expressions consist of combinations of basic emotions and are crucial for understanding human emotions in real-world scenarios. The lack of comprehensive training datasets for compound expressions led to the proposal of a zero-shot approach leveraging a pretrained visual language model and CNN networks. The authors participated in the 6th ABAW Challenge, where they were provided with unlabeled datasets containing compound expressions to develop their recognition system. By utilizing the C-EXPR-DB database, which includes videos annotated with 12 compound expressions, they focused on seven specific compound expressions for their challenge. The integration of large-scale visual language pre-trained models like Claude3 enhanced their recognition capabilities significantly. The methodology involved annotating unlabeled data, training CNN classification networks, and fine-tuning them using labels generated by the visual language model. The implementation details included data processing steps like face detection and alignment using Retinaface and utilizing various CNNs such as mobilenetV2, resnet152, densenet121, resnet18, and densenet201. Evaluation metrics were based on F1 Score across all seven compound expressions to assess performance accurately.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Jiahe Wang,J... a las arxiv.org 03-19-2024
https://arxiv.org/pdf/2403.11450.pdfConsultas más profundas