The paper addresses the limitation of mainstream OVD models in detecting objects with fine-grained attributes, as they prioritize coarse-grained category detection over fine-grained attribute detection. The authors propose a three-step approach called HA-FGOVD to address this issue:
Attribute Word Extraction: A large language model (LLM) is used to identify attribute words within the input text as a zero-shot prompted task.
Attribute Feature Extraction: The text encoder of the OVD model is modified to extract both global text features and attribute-specific features by strategically adjusting the token attention masks.
Attribute Feature Enhancement: The global text features and attribute-specific features are fused through an explicit linear composition, with hand-crafted or learned weight scalars to reweight the two vectors. This new attribute-highlighted feature is then used for the object detection task.
The authors demonstrate that the weight scalars for the linear composition can be seamlessly transferred among different OVD models, proving the universality of the approach. Experiments on the FG-OVD dataset show that HA-FGOVD significantly improves the fine-grained attribute-level detection performance of various mainstream OVD models, achieving new state-of-the-art results.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Yuqi Ma, Men... kl. arxiv.org 09-25-2024
https://arxiv.org/pdf/2409.16136.pdfDybere Forespørgsler