核心概念
InstructDET aims to diversify referring object detection instructions by leveraging foundation models to generate human-like expressions, improving object detection performance.
摘要
InstructDET introduces a data-centric method for referring object detection (ROD) that localizes target objects based on user instructions. The method leverages foundation models to produce human-like instructions that encompass common user intentions related to object detection. The dataset, InDET, is developed from existing REC datasets and object detection datasets, allowing for the incorporation of images with object bounding boxes. By using InDET, a conventional ROD model surpasses existing methods on standard REC datasets and the InDET test set. InstructDET directs a promising field where ROD can be diversified to execute common object detection instructions effectively.
Structure:
- Abstract
- Introduction
- Related Works
- InstructDET
- Global Prompt Pipeline
- Local Prompt Pipeline
- Expression Filter
- Multi-Objects Expression Generation
- Dataset Analysis
- Referring Object Detection
- Experiments
- Concluding Remarks
- Acknowledgement
- References
統計資料
"Our InDET dataset contains images from MSCOCO (Lin et al., 2014), Flicker (Plummer et al., 2015), and Objects365 (Shao et al., 2019)."
"There are 120.6K images with 908.4K referring object sets in total."
"The average instruction length is 6.2 words and the vocabulary size is 63k words."
引述
"Our InstructDET method can automatically expand training data by using in-the-wild images with object bbxs, which improves our model generalizations towards practical usage."
"By leveraging our InDET, the ROD model becomes more practically applicable."