Concetti Chiave
A unified framework that effectively leverages both visual and verbal references to improve target perception and discrimination for natural language tracking.
Sintesi
The content presents a novel framework, termed QueryNLT, for natural language-guided object tracking. The key contributions are:
- Prompt Modulation Module:
- Exploits the complementarity between dynamic historical information and the language description to generate accurate and context-aware visual and verbal cues for the target.
- The language prompt is re-weighted based on the motion cues from the template memory to align with the current scene.
- The appearance prompt is generated by filtering out background features in the template using the category or appearance description in the language.
- Target Decoding Module:
- Treats language-based matching and appearance-based matching as a unified instance retrieval problem.
- Comprises a multi-modal query generator that aggregates visual and verbal cues into a holistic object vector, and a query-based target locator that establishes the correspondence between the query vector and the search image.
- Directly predicts the target location in an end-to-end manner, ensuring spatio-temporal consistency.
The proposed framework is extensively evaluated on three natural language tracking datasets (TNL2K, OTB-Lang, LaSOT) and a visual grounding dataset (RefCOCOg). The results demonstrate the effectiveness of the multi-modal approach and the superiority of the proposed method compared to state-of-the-art trackers.
Statistiche
The paper does not provide any specific numerical data or statistics. The focus is on the proposed framework and its evaluation on benchmark datasets.