Belangrijkste concepten
This paper introduces a novel multi-modal face generation framework that can simultaneously control the identity, expression, and background of the generated faces, while enabling fine-grained expression synthesis beyond the commonly used coarse expression labels.
Samenvatting
The paper proposes a novel face generation framework that takes three inputs - a prompt describing the background, a selfie photo of the user, and a text related to the desired expression - and generates a face image that matches these inputs.
The key technical contribution is a novel diffusion model, named DiffSFSR, that can perform Simultaneous Face Swapping and Reenactment (SFSR). This allows the framework to separately transfer the identity from the user's selfie and the expression from the text prompt to the generated face, while keeping the background attributes unchanged.
The paper introduces several innovative designs in the DiffSFSR diffusion model to improve its controllability and image quality:
- Balancing identity and expression encoders to reduce the transfer of residual identity attributes
- Improved midpoint sampling to efficiently impose identity and expression constraints during training
- Explicitly conditioning the diffusion model on the background image during training to help recover face pose and lighting
Extensive experiments demonstrate the framework's ability to generate high-quality faces with fine-grained expression control, outperforming state-of-the-art text-to-image, face swapping, and face reenactment methods.
Statistieken
"The generated faces well match the inputted triples and exhibit fine-grained expression synthesis."
"Extensive experiments have demonstrated the controllability and scalability of the proposed framework, in comparison with state-of-the-art text-to-image, face swapping, and face reenactment methods."
Citaten
"To tackle these issues, this paper proposes a novel framework that can simultaneously control identity, expression, and background from multi-modal inputs."
"The technical core inside the proposed framework is a novel diffusion model that can conduct Simultaneous Face Swapping and Reenactment (SFSR)."
"We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment."