Core Concepts
Our method fuses a personalized 3D subject prior with per-frame information to produce temporally stable 3D videos with faithful reconstruction of the user's dynamic appearances, such as facial expressions and lighting.
Abstract
The paper proposes a novel triplane fusion method for reconstructing coherent 3D portrait videos from a monocular RGB video. The key insight is to leverage a personalized 3D subject prior extracted from a frontal reference image, and fuse it with per-frame information to capture both temporal consistency and dynamic appearance of the user.
The method first uses a pre-trained LP3D model to construct a personal triplane prior from a frontal image of the user. During video reconstruction, it lifts each input frame into a raw triplane, which is then undistorted and fused with the personal triplane prior. The undistorter module learns to correct view-dependent distortions in the raw triplane, while the fuser module densely aligns and fuses the undistorted triplane with the reference triplane, preserving dynamic lighting, expression and posture information from the input.
The method is trained using only synthetic data generated by a pre-trained 3D GAN, with carefully designed augmentations to account for shoulder motion and lighting changes. Evaluations on both in-studio and in-the-wild datasets demonstrate that the proposed method achieves state-of-the-art performance in both temporal consistency and reconstruction accuracy, outperforming recent single-view 3D reconstruction and reenactment methods.
Stats
Our method achieves state-of-the-art performance in both temporal consistency and reconstruction accuracy, outperforming recent single-view 3D reconstruction and reenactment methods.
Quotes
"We recognize the need to maintain both coherent identity and dynamic per-frame appearance to enable the best possible realism."
"Our key insight to solving this problem is to employ a fusion-based approach to achieve all of these properties: the approach needs to leverage the stability and accuracy of a personalized 3D prior, and needs to fuse the prior with per-frame observations to capture the diverse deviations from the prior."