Core Concepts
InstructDET aims to diversify referring object detection instructions by leveraging foundation models to generate human-like expressions, improving practical usage from a data-centric perspective.
Abstract
Abstract: InstructDET proposes a data-centric method for referring object detection (ROD) to localize target objects based on user instructions.
Introduction: ROD aims to detect target objects according to language reference representing user intentions, closely related to visual grounding.
Data Extraction:
"Our InDET dataset contains images from MSCOCO, Flicker, and Objects365."
"In our InDET test set, we compare our DROD model to other methods under the evaluation metric of object bbx average precision."
Quotations:
"Our InDET dataset improves logic reasoning of ROD models."
"By leveraging our InDET, the ROD model becomes more practically applicable."
Data Analysis:
InstructDET dataset contains 120.6K images with 908.4K referring object sets and 3.6M instructions.
InDET dataset surpasses existing datasets in instruction quantity, richness, and vocabulary breadth.
Experiments:
DROD model achieves favorable performance on InDET test set and standard VG benchmarks.
DROD model outperforms existing VG methods on InDET test set, showcasing improved comprehension of instructions.
Concluding Remarks:
InstructDET method leverages foundation models to improve ROD model generalizations and logic reasoning.
Stats
"Our InDET dataset contains images from MSCOCO, Flicker, and Objects365."
"In our InDET test set, we compare our DROD model to other methods under the evaluation metric of object bbx average precision."
Quotes
"Our InDET dataset improves logic reasoning of ROD models."
"By leveraging our InDET, the ROD model becomes more practically applicable."