Core Concepts
A novel contrastive learning framework that incorporates both self-supervised and supervised signals to enhance the learning of discriminative features for accurate facial action unit detection, addressing challenges such as class imbalance and noisy labels.
Abstract
The paper proposes a contrastive learning framework for facial action unit (AU) detection, which aims to learn discriminative feature representations by incorporating both self-supervised and supervised signals. The key highlights are:
-
Contrastive Learning Framework:
- The proposed framework, named AUNCE, replaces the traditional pixel-level learning methods and enables lightweight model development for AU detection.
- It maximizes the consistency of positive samples (semantically similar) and minimizes the consistency of negative samples (semantically dissimilar).
-
Negative Sample Re-weighting Strategy:
- Addresses the class imbalance issue of each AU type by adjusting the gradient magnitude of minority and majority class samples during back-propagation.
- Facilitates the mining of hard negative samples by applying importance weights to negative samples based on their similarity to the anchor.
-
Positive Sample Sampling Strategy:
- Tackles the issue of noisy and false AU labels by integrating label noisy learning and self-supervised learning techniques.
- Selects the most representative positive samples by combining supervised signals from label smoothing and self-supervised signals from data augmentation.
The proposed AUNCE framework is evaluated on four widely-used benchmark datasets (BP4D, DISFA, GFT, and Aff-Wild2), demonstrating superior performance compared to state-of-the-art methods for AU detection.
Stats
The occurrence rates of different AUs in the training sets of the datasets vary significantly, indicating a prevalent class imbalance issue.
The BP4D dataset contains around 140,000 frames, the DISFA dataset has 130,788 frames, the GFT dataset has 108,000 frames in the training set, and the Aff-Wild2 dataset has 1,390,000 frames in the training set.
Quotes
"Facial action unit (AU) detection has long encountered the challenge of detecting subtle feature differences when AUs activate."
"Existing methods often rely on encoding pixel-level information of AUs, which not only encodes additional redundant information but also leads to increased model complexity and limited generalizability."
"The accuracy of AU detection is negatively impacted by the class imbalance issue of each AU type, and the presence of noisy and false AU labels."