Proposing CLIP-VQDiffusion for language-free training in text-to-image generation, outperforming state-of-the-art methods on FFHQ dataset.
InstructCV introduces a unified language interface for computer vision tasks, leveraging text-to-image generative models to enhance generalization capabilities.
SELMA introduces a novel paradigm to enhance the faithfulness of Text-to-Image models by fine-tuning on auto-generated, multi-skill datasets with skill-specific expert learning and merging.
Two-stage method combining controllability and high quality in image generation.
InstructCV introduces a unified language interface for computer vision tasks, leveraging text-to-image generative models to enhance generalization capabilities across diverse datasets and user instructions.
A two-stage method is proposed to combine controllability and high quality in image generation by leveraging pre-trained models and diffusion models, achieving outstanding results.