This research paper introduces RPO, a novel approach for subject-driven text-to-image generation that leverages a λ-Harmonic reward function and preference-based reinforcement learning to efficiently fine-tune diffusion models, achieving state-of-the-art results in generating images faithful to both reference images and textual prompts.
ReNO is a novel approach that significantly improves the quality and prompt adherence of one-step text-to-image synthesis models by optimizing the initial latent noise vector based on feedback from multiple human preference reward models.
본 논문에서는 텍스트-투-이미지 기반 모델에 어댑터를 삽입하여 기본 모델의 일반화 능력을 유지하면서 복잡한 다운스트림 작업을 수행할 수 있는 효과적인 방법을 제안합니다.
本稿では、テキストから画像への基盤モデルにアダプターを挿入する効果的な手法を提案する。この手法により、ベースモデルの汎化能力を維持しながら、複雑なダウンストリームタスクを実行することができる。
Diffusion-based text-to-image models outperform autoregressive models in compositional generation tasks, suggesting that the inductive bias of next-token prediction alone is insufficient for complex image generation from text.
본 논문에서는 사전 훈련된 Diffusion Model이 희귀 개념을 생성하는 데 어려움을 겪는 문제를 해결하기 위해 LLM(Large Language Model)을 활용하여 희귀 개념을 빈번한 개념으로 변환하여 학습 없이도 이미지 생성 품질을 향상시키는 R2F(Rare-to-Frequent) 프레임워크를 제안합니다.
Leveraging the semantic knowledge of Large Language Models (LLMs) to guide the diffusion process significantly improves the ability of text-to-image diffusion models to generate images from prompts containing rare or unusual compositions of concepts.
GROUNDIT, a novel training-free technique, enhances the spatial accuracy of text-to-image generation using Diffusion Transformers by cultivating and transplanting noisy image patches within specified bounding boxes, leading to more precise object placement compared to previous methods.
Diff-Instruct++ is a novel method for aligning one-step text-to-image generators with human preferences by leveraging a teacher-student approach, achieving superior image quality and adherence to user prompts.
RealignDiff, a novel two-stage semantic re-alignment method, significantly improves the alignment between generated images and textual descriptions in text-to-image diffusion models by employing coarse-grained caption reward feedback and fine-grained local caption-guided attention modulation.