Core Concepts
InstructDET aims to diversify referring object detection instructions using foundation models, improving practical usage.
Abstract
Abstract:
InstructDET proposes a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions.
The method leverages diversified instructions to encompass common user intentions related to object detection.
By incorporating emerging vision-language models, InstructDET generates human-like expressions for ROD training.
Introduction:
Referring object detection (ROD) detects target objects based on language reference representing user intentions.
Current visual grounding methods lack practical usage due to limited expressions in referring expression comprehension datasets.
Data Generation via Foundation Models:
InstructDET uses foundation models to generate diverse instructions for single and multiple objects in images.
The dataset, InDET, contains images, bbxs, and generalized instructions from foundation models.
Multi-Objects Expression Generation:
Instructions are concatenated and clustered to summarize commonalities among multiple objects.
LLaMA is used to generate text descriptions for each cluster center.
Dataset Analysis:
InDET dataset contains 120.6K images with 908.4K referring object sets and 3.6M instructions.
Instructions are divided into 6 groups based on category, attribute, and relations emphasis levels.
Referring Object Detection Experiments:
DROD model outperforms existing VG methods on the InDET test set by comprehending instruction meanings effectively.
Stats
この論文はICLR 2024で発表されました。
我々のInstructDETモデルは、様々な指示を生成し、実用的な使用を向上させます。
InDETデータセットには120.6K枚の画像と908.4Kの参照オブジェクトセットが含まれています。