Core Concepts
InstructCV introduces a unified language interface for computer vision tasks, leveraging text-to-image generative models to enhance generalization capabilities.
Abstract
Introduction:
Recent advances in generative diffusion models enable text-controlled image synthesis.
Current approaches focus on task-specific architectures and loss functions.
InstructCV Framework:
Develops a unified language interface for computer vision tasks.
Multiple tasks are cast as text-to-image generation problems using natural language instructions.
Training Process:
Utilizes a multi-modal, multi-task dataset for instruction-tuning a pre-trained diffusion model.
Enhances generalization capabilities to unseen data, categories, and user instructions.
Experiments:
Competitive performance compared to other vision models across various tasks.
Demonstrates compelling generalization properties to new datasets and categories.
Limitations and Future Work:
Inference speed lags behind specialized models for real-time applications.
Potential improvements through learning from human feedback and more nuanced conditions.
Stats
Recent work on text-to-image models has achieved impressive performance in image synthesis [1–3].
Models like DALL·E [2] and Stable Diffusion [8] highlight this progress, now finding use in real-world applications.
To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification.
Our pooled multi-modal/multi-task instruction-tuning dataset comprises 180,285 images.
The inference time of InstructCV on a single NVIDIA A100 GPU is 5 seconds (for a 256x256 image).