核心概念
InstructCV introduces a unified language interface for computer vision tasks, leveraging text-to-image generative models to enhance generalization capabilities.
統計資料
Recent work on text-to-image models has achieved impressive performance in image synthesis [1–3].
Models like DALL·E [2] and Stable Diffusion [8] highlight this progress, now finding use in real-world applications.
To train our model, we pool commonly-used computer vision datasets covering a range of tasks, including segmentation, object detection, depth estimation, and classification.
Our pooled multi-modal/multi-task instruction-tuning dataset comprises 180,285 images.
The inference time of InstructCV on a single NVIDIA A100 GPU is 5 seconds (for a 256x256 image).