The paper proposes Parts2Whole, a unified reference framework for generating customized human portraits from multiple reference images. The key components are:
Semantic-Aware Appearance Encoder: Encodes each reference image with its textual label into multi-scale feature maps, preserving appearance details and spatial information.
Shared Self-Attention: Injects the reference features into the generation process by sharing keys and values in self-attention layers between the appearance encoder and the denoising U-Net. This allows each location in the target image to attend to all locations in the reference features.
Enhanced Mask-Guided Subject Selection: Enhances the self-attention mechanism by incorporating subject masks in the reference images, enabling precise selection of specified parts from each reference.
The proposed framework demonstrates superior quality and controllability for human image generation compared to existing alternatives, including test-time fine-tuning methods and reference-based zero-shot methods. It can generate high-fidelity human portraits from varying numbers and combinations of reference images.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zehuan Huang... at arxiv.org 04-24-2024
https://arxiv.org/pdf/2404.15267.pdfDeeper Inquiries