Core Concepts
SELMA introduces a novel paradigm to enhance the faithfulness of Text-to-Image models by fine-tuning on auto-generated, multi-skill datasets with skill-specific expert learning and merging.
Abstract
Abstract:
Recent T2I models struggle with precise image generation from text prompts.
SELMA proposes a new approach using skill-specific expert learning and merging.
Introduction:
Challenges in current T2I models include spatial relationships and text rendering.
SELMA Methodology:
Skill-Specific Prompt Generation with LLMs for diverse skills.
Image Generation with T2I Model based on generated prompts.
Fine-tuning T2I models with LoRA modules for different skills.
Merging skill-specific experts to build a joint multi-skill T2I model.
Results:
SELMA significantly improves semantic alignment and text faithfulness in state-of-the-art T2I models.
Fine-tuning with auto-generated data shows comparable performance to ground truth data.
Related Work:
Various methods have been proposed to improve text-to-image generation, focusing on supervised fine-tuning or aligning models with human preferences.
Stats
SELMAは、T2Iモデルの信頼性を向上させる新しい手法を導入します。
自動生成されたマルチスキルデータセットでのスキル固有のエキスパート学習とマージングに焦点を当てています。