The authors propose a framework called Fashion Style Editing (FaSE) for text-driven fashion style editing on human imagery. The key challenges in this task are the modeling complexity of whole human body imagery and the elusiveness of fashion concepts.
To address these challenges, the authors explore two directions to visually reinforce the guidance signal:
Textual Augmentation: They enrich the text prompt by querying a language model to generate detailed descriptions for the target fashion concept, making the context more illustrative.
Editing with Visual Reference: They construct a reference database of fashion images and retrieve relevant references to guide the model, leveraging the W+ space of the generator as a more illustrative feature space compared to CLIP.
The authors also investigate the hierarchical latent space structure of StyleGAN-Human and discover that the mid-level group controls garment shape, enabling targeted fashion editing operations.
Qualitative and quantitative evaluations demonstrate the effectiveness of the proposed FaSE framework in projecting diverse fashion styles onto human images, outperforming the baseline text-driven editing method.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Chaerin Kong... at arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.01984.pdfDeeper Inquiries