Personalized image generation aims to render subjects in novel scenes, styles, and actions.
Diffusion-based methods have advanced personalized image generation.
Existing Methods:
Fine-tuning-based methods require several images of the specified subject for model optimization.
Tuning-free methods train on large-scale datasets and encode any image into embeddings for personalization.
Proposed MM-Diff:
Integrates vision-augmented text embeddings and detail-rich subject embeddings into the diffusion model.
Introduces cross-attention map constraints for multi-subject image generation without predefined inputs.
Experimental Results:
MM-Diff outperforms other leading methods in subject fidelity and text consistency across various test sets.
ปรับแต่งบทสรุป
เขียนใหม่ด้วย AI
สร้างการอ้างอิง
แปลแหล่งที่มา
เป็นภาษาอื่น
สร้าง MindMap
จากเนื้อหาต้นฉบับ
ไปยังแหล่งที่มา
arxiv.org
MM-Diff
สถิติ
"Personalization is expensive, as these methods typically need 10-30 minutes to fine-tune the model for each new subject using specially crafted data."
"Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods."