Core Concepts
The core message of this paper is that the misalignment between text prompts and generated images in diffusion models is caused by insufficient attention to certain text tokens, which can be addressed by incorporating an image-to-text concept matching mechanism and an attribute concentration module.
Abstract
The paper proposes a novel method called CoMat to enhance text-to-image diffusion models. The authors observe that the misalignment between text prompts and generated images is caused by the diffusion model's insufficient utilization of text condition information, leading to certain tokens being overlooked during generation.
To address this issue, the authors introduce two key components:
Concept Matching: The authors leverage a pre-trained image captioning model to measure the alignment between the generated image and the input text prompt. This provides guidance to the diffusion model, forcing it to revisit and attend to the previously ignored text tokens.
Attribute Concentration: To further improve attribute binding, the authors introduce an attribute concentration module. This module enforces the attention of both the entity tokens and their attributes to focus on the same region in the generated image.
Additionally, the authors incorporate a fidelity preservation module to prevent the diffusion model from overfitting to the concept matching and attribute concentration objectives, which could lead to a deterioration of its original generation capability.
The authors evaluate their method on two benchmarks, T2I-CompBench and TIFA, and demonstrate significant improvements over the baseline diffusion models in terms of text-image alignment, attribute binding, and complex reasoning. Qualitative results also show that CoMat-SDXL generates images that are better aligned with the input prompts compared to other state-of-the-art models.
Stats
The misalignment issue is caused by insufficient attention to certain text tokens during the diffusion process.
The overall distribution of text token activation remains at a low level during generation, indicating incomplete utilization of text condition information.
Quotes
"The root reason behind the misalignment has not been extensively investigated. We observe that the misalignment is caused by inadequate token attention activation."
"We further attribute this phenomenon to the diffusion model's insufficient condition utilization, which is caused by its training paradigm."