Grunnleggende konsepter
Fine-tuning the pre-trained text encoder can significantly enhance the performance of text-to-image diffusion models, leading to better image quality and text-image alignment, without introducing additional computational or storage overhead.
Sammendrag
The content discusses a novel approach called TextCraftor for fine-tuning the pre-trained text encoder in text-to-image diffusion models to improve their performance.
Key highlights:
- Existing text-to-image diffusion models, such as Stable Diffusion, often struggle to generate images that align well with the provided text prompts, and require extensive prompt engineering to achieve satisfactory results.
- Prior studies have explored replacing the CLIP text encoder used in Stable Diffusion with larger language models, but this incurs significant computational and storage overhead.
- The authors propose TextCraftor, an end-to-end fine-tuning technique that enhances the pre-trained text encoder using reward functions, without requiring paired text-image datasets.
- TextCraftor leverages public reward models, such as image aesthetics predictors and text-image alignment assessment models, to fine-tune the text encoder in a differentiable manner.
- Comprehensive evaluations on public benchmarks and human assessments demonstrate that TextCraftor significantly outperforms pre-trained text-to-image models, prompt engineering, and reinforcement learning-based approaches.
- The fine-tuned text encoder can be interpolated with the original text embeddings to achieve more diverse and controllable style generation.
- TextCraftor is orthogonal to UNet fine-tuning and can be combined to further improve the generative quality.
Statistikk
Stable Diffusion v1.5 model is used as the baseline.
The ViT-L text encoder from Stable Diffusion is fine-tuned using TextCraftor.
Training is performed on the OpenPrompt dataset with over 10M high-quality prompts.
Sitater
"We demonstrate that for a well-trained text-to-image diffusion model, fine-tuning text encoder is a buried gem, and can lead to significant improvements in image quality and text-image alignment."
"Compared with using larger text encoders, e.g., SDXL, TextCraftor does not introduce extra computation and storage overhead. Compared with prompt engineering, TextCraftor reduces the risks of generating irrelevant content."