The paper analyzes two types of overfitting in customized text-to-image (T2I) diffusion models: concept-agnostic overfitting and concept-specific overfitting.
Concept-agnostic overfitting undermines the non-customized generative capabilities of the foundational T2I model, while concept-specific overfitting confines the customized model to limited training modalities. To quantify these effects, the authors introduce the "Latent Fisher divergence" and "Wasserstein metric" respectively.
To address these challenges, the authors propose Infusion, a T2I customization method that decouples the attention maps and value features in the cross-attention module. Infusion leverages the attention maps from the foundational T2I model to preserve its generative diversity, while learning a lightweight residual embedding to inject the customized concepts. This approach enables Infusion to achieve plug-and-play single-concept and multi-concept generation, outperforming state-of-the-art methods in both text alignment and conceptual fidelity.
Extensive experiments demonstrate Infusion's robust resistance to overfitting, maintaining high-quality customized generation even with prolonged training on limited data. The authors also conduct a user study, which confirms the general preference for Infusion's customization capabilities over other baselines.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Weili Zeng,Y... um arxiv.org 04-23-2024
https://arxiv.org/pdf/2404.14007.pdfTiefere Fragen