Core Concepts
S3Editor is a novel framework that comprehensively addresses the challenges in face video editing by introducing a self-training strategy, a semantic disentangled architecture, and a sparse learning technique to improve identity preservation, editing faithfulness, and temporal consistency.
Abstract
The paper introduces S3Editor, a Sparse Semantic-Disentangled Self-Training framework for face video editing. The key contributions are:
Self-Training Strategy:
Addresses the scarcity of supervised paired data for face video editing.
Generates pseudo-edited facial representations by uniformly sampling from an editing attribute pool.
Designs objectives for identity preservation and editing faithfulness to semi-supervise the training process.
Enhances the generalization capabilities of existing models, leading to superior editing results.
Semantic Disentangled Editing Architecture:
Classifies all potential edits into multiple clusters based on their semantic representations.
Establishes a learnable transformation specific to each cluster.
Dynamically activates the transformations based on the specific edit demand, enabling an adaptive editing framework.
Augments the model's capacity and complements the self-training strategies.
Sparse Learning to Avoid Over-Editing:
Partitions facial latent representations into multiple distinct regions.
Actively promotes region sparsity during the training process.
Enables the model to recognize and transform only the most pertinent facial areas for each specific edit.
Contributes to more precise editing and enhances the semantic disentangled architecture.
The proposed S3Editor framework is model-agnostic and compatible with various face video editing methods, such as the GAN-based Latent Transformer and the diffusion-based DiffVAE. Comprehensive qualitative and quantitative results demonstrate that S3Editor significantly improves identity preservation, editing faithfulness, and temporal consistency, while avoiding over-editing.
Stats
The face video dataset used for training and evaluation is VoxCeleb.
Each frame is resized to 256x256 resolution.