SELMA: Improving Text-to-Image Models with Skill-Specific Expert Learning and Merging
Core Concepts
SELMA introduces a novel paradigm to enhance the faithfulness of Text-to-Image models by fine-tuning on auto-generated, multi-skill datasets with skill-specific expert learning and merging.
Abstract
Abstract:
Recent T2I models struggle with precise image generation from text prompts.
SELMA proposes a new approach using skill-specific expert learning and merging.
Introduction:
Challenges in current T2I models include spatial relationships and text rendering.
SELMA Methodology:
Skill-Specific Prompt Generation with LLMs for diverse skills.
Image Generation with T2I Model based on generated prompts.
Fine-tuning T2I models with LoRA modules for different skills.
Merging skill-specific experts to build a joint multi-skill T2I model.
Results:
SELMA significantly improves semantic alignment and text faithfulness in state-of-the-art T2I models.
Fine-tuning with auto-generated data shows comparable performance to ground truth data.
Related Work:
Various methods have been proposed to improve text-to-image generation, focusing on supervised fine-tuning or aligning models with human preferences.
この研究では、弱いT2Iモデルから生成された画像を使用して、強いT2Iモデルを学習することで性能向上が実現されます。具体的には、SD v2などの弱いベースラインモデルが生成した画像を使ってSDXLなどのより強力なベースラインモデルを学習します。このアプローチは、「weak-to-strong generalization」と呼ばれる概念であり、過去にLLMs(Large Language Models)で探求されてきました。これは、より低レイテンシーまたはリソース要件を持つ「weak」エージェント(例:GPT-2)が生成した応答を使用して、「strong」エージェント(例:GPT-4)を訓練する方法です。