Core Concepts
SMIRK faithfully reconstructs expressive 3D faces from monocular images by replacing the traditional differentiable rendering with a neural rendering module, and augmenting the training data with diverse expressions.
Abstract
The paper introduces SMIRK, a novel method for accurate 3D facial expression reconstruction from monocular images. The key contributions are:
A neural rendering module that replaces the traditional differentiable rendering, providing a stronger supervision signal for the 3D geometry reconstruction.
An augmented expression cycle path that generates diverse expressions during training, including extreme, asymmetric, and subtle expressions, to improve the generalization of the model.
The authors identify two key limitations in existing methods: shortcomings in the self-supervised training formulation, and lack of expression diversity in the training data. To address these, SMIRK employs a neural rendering module that generates a face image from the predicted 3D geometry and sparse image pixels, enabling the model to focus solely on optimizing the geometry. Additionally, the augmented expression cycle path synthesizes novel expressions and enforces consistency between the generated and reconstructed expressions, promoting diverse and accurate expression reconstruction.
Extensive experiments, including quantitative evaluations on emotion recognition, image reconstruction, and a perceptual user study, demonstrate that SMIRK achieves state-of-the-art performance in faithfully capturing a wide range of facial expressions, including challenging cases such as asymmetric and subtle expressions.
Stats
The paper does not provide any specific numerical data or statistics to support the key claims. The results are presented through qualitative comparisons, quantitative evaluations on emotion recognition and image reconstruction, and a perceptual user study.
Quotes
"SMIRK replaces the differentiable rendering with a neural rendering module that, given the rendered predicted mesh geometry, and sparsely sampled pixels of the input image, generates a face image."
"We leverage this while training with an expression consistency / augmentation loss. This renders a mesh of the input identity under a novel expression, renders an image with the generator, project the rendering through the encoder, and penalizes the difference between the augmented and the reconstructed expression parameters."