Recent advancements in generative diffusion models have revolutionized text-controlled image synthesis. InstructCV aims to bridge the gap between text-to-image generative models and standard visual recognition tasks by developing a unified language interface. By casting various computer vision tasks as text-to-image generation problems, InstructCV utilizes natural language instructions to guide the model's functionality. The model is trained on a multi-modal and multi-task dataset, enabling it to perform competitively compared to other vision models. In experiments, InstructCV showcases compelling generalization capabilities to unseen data, categories, and user instructions.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Yulu Gan,Sun... a las arxiv.org 03-15-2024
https://arxiv.org/pdf/2310.00390.pdfConsultas más profundas