toplogo
Connexion

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning (RPO)


Concepts de base
This research paper introduces RPO, a novel approach for subject-driven text-to-image generation that leverages a λ-Harmonic reward function and preference-based reinforcement learning to efficiently fine-tune diffusion models, achieving state-of-the-art results in generating images faithful to both reference images and textual prompts.
Résumé
  • Bibliographic Information: Miao, Y., Loh, W., Kothawade, S., Poupart, P., Rashwan, A., & Li, Y. (2024). Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning. Advances in Neural Information Processing Systems, 36.

  • Research Objective: This paper aims to address the limitations of existing text-to-image generation models in accurately portraying specific subjects from reference images while adhering to textual prompts. The authors propose a novel method, Reward Preference Optimization (RPO), to improve the fidelity of generated images to both reference images and textual descriptions.

  • Methodology: RPO leverages a novel λ-Harmonic reward function that combines image-to-image and text-to-image alignment scores to guide the training process. This function enables early stopping to prevent overfitting to reference images and accelerates training. The method utilizes the Bradley-Terry preference model to generate preference labels from the reward function, guiding a preference-based reinforcement learning algorithm to fine-tune a pre-trained diffusion model.

  • Key Findings: RPO demonstrates superior performance compared to existing state-of-the-art methods on the DreamBench dataset, achieving a CLIP-I score of 0.833 and a CLIP-T score of 0.314. The ablation study highlights the importance of both the λ-Harmonic reward function and the preference loss in achieving these results. The λ-Harmonic reward function effectively guides the model towards generating images faithful to both reference images and textual prompts, while the preference loss acts as a regularizer, preventing overfitting to the reference images.

  • Main Conclusions: RPO presents a more efficient and effective approach for subject-driven text-to-image generation compared to existing methods. The proposed λ-Harmonic reward function and the use of preference-based reinforcement learning contribute significantly to its superior performance in generating high-fidelity images that accurately reflect both the subject and the textual description.

  • Significance: This research significantly contributes to the field of text-to-image generation by introducing a novel reward function and a more efficient training approach. RPO's ability to generate high-fidelity images faithful to both reference images and textual prompts has significant implications for various applications, including content creation, image editing, and design.

  • Limitations and Future Research: While RPO demonstrates promising results, the authors acknowledge limitations regarding the sensitivity of the λ-Harmonic reward function to the choice of λ value. Future research could explore methods for automatically determining the optimal λ value or investigate alternative reward functions less sensitive to hyperparameter tuning. Additionally, exploring the application of RPO to other text-to-image generation tasks beyond subject-driven generation could further validate its effectiveness and broader applicability.

edit_icon

Personnaliser le résumé

edit_icon

Réécrire avec l'IA

edit_icon

Générer des citations

translate_icon

Traduire la source

visual_icon

Générer une carte mentale

visit_icon

Voir la source

Stats
RPO achieves a state-of-the-art CLIP-I score of 0.833. RPO achieves a state-of-the-art CLIP-T score of 0.314. RPO only requires 3% of the negative samples compared to DreamBooth. RPO requires fewer gradient steps compared to DreamBooth. The fine-tuning process for RPO takes about 5 to 20 minutes on a single Google Cloud Platform TPUv4-8 (32GB) for Stable Diffusion.
Citations
"In this paper, we propose a λ-Harmonic reward function that enables early stopping and accelerates training." "Our method, Reward Preference Optimization (RPO), only requires a few input reference images and the finetuned diffusion model can generate images that preserve the identity of a specific subject while aligning well with textual prompts." "Empirically, λ-Harmonic proves to be a reliable approach for model selection in subject-driven generation tasks." "Based on preference labels and early stopping validation from the λ-Harmonic reward function, our algorithm achieves a state-of-the-art CLIP-I score of 0.833 and a CLIP-T score of 0.314 on DreamBench."

Questions plus approfondies

0
star