텍스트-이미지 생성 모델 학습에서 캡션의 정밀도가 재현율보다 이미지 생성 성능에 더 큰 영향을 미치며, Large Vision Language Models (LVLM)을 활용하여 생성한 합성 캡션도 인간의 주석과 유사한 경향을 보인다.
Prioritizing precision over recall in image captions, whether human-annotated or synthetically generated, leads to better performance in training text-to-image generation models, particularly in terms of compositional capabilities.
This research paper introduces RPO, a novel approach for subject-driven text-to-image generation that leverages a λ-Harmonic reward function and preference-based reinforcement learning to efficiently fine-tune diffusion models, achieving state-of-the-art results in generating images faithful to both reference images and textual prompts.
ReNO is a novel approach that significantly improves the quality and prompt adherence of one-step text-to-image synthesis models by optimizing the initial latent noise vector based on feedback from multiple human preference reward models.
본 논문에서는 텍스트-투-이미지 기반 모델에 어댑터를 삽입하여 기본 모델의 일반화 능력을 유지하면서 복잡한 다운스트림 작업을 수행할 수 있는 효과적인 방법을 제안합니다.
本稿では、テキストから画像への基盤モデルにアダプターを挿入する効果的な手法を提案する。この手法により、ベースモデルの汎化能力を維持しながら、複雑なダウンストリームタスクを実行することができる。
Diffusion-based text-to-image models outperform autoregressive models in compositional generation tasks, suggesting that the inductive bias of next-token prediction alone is insufficient for complex image generation from text.
본 논문에서는 사전 훈련된 Diffusion Model이 희귀 개념을 생성하는 데 어려움을 겪는 문제를 해결하기 위해 LLM(Large Language Model)을 활용하여 희귀 개념을 빈번한 개념으로 변환하여 학습 없이도 이미지 생성 품질을 향상시키는 R2F(Rare-to-Frequent) 프레임워크를 제안합니다.
Leveraging the semantic knowledge of Large Language Models (LLMs) to guide the diffusion process significantly improves the ability of text-to-image diffusion models to generate images from prompts containing rare or unusual compositions of concepts.
GROUNDIT, a novel training-free technique, enhances the spatial accuracy of text-to-image generation using Diffusion Transformers by cultivating and transplanting noisy image patches within specified bounding boxes, leading to more precise object placement compared to previous methods.