Personalized image generation aims to render subjects in novel scenes, styles, and actions.
Diffusion-based methods have advanced personalized image generation.
Existing Methods:
Fine-tuning-based methods require several images of the specified subject for model optimization.
Tuning-free methods train on large-scale datasets and encode any image into embeddings for personalization.
Proposed MM-Diff:
Integrates vision-augmented text embeddings and detail-rich subject embeddings into the diffusion model.
Introduces cross-attention map constraints for multi-subject image generation without predefined inputs.
Experimental Results:
MM-Diff outperforms other leading methods in subject fidelity and text consistency across various test sets.
Personalizar resumen
Reescribir con IA
Generar citas
Traducir fuente
A otro idioma
Generar mapa mental
del contenido fuente
Ver fuente
arxiv.org
MM-Diff
Estadísticas
"Personalization is expensive, as these methods typically need 10-30 minutes to fine-tune the model for each new subject using specially crafted data."
"Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods."