Core Concepts
A unified framework that effectively leverages both visual and verbal references to improve target perception and discrimination for natural language tracking.
Abstract
The content presents a novel framework, termed QueryNLT, for natural language-guided object tracking. The key contributions are:
Prompt Modulation Module:
Exploits the complementarity between dynamic historical information and the language description to generate accurate and context-aware visual and verbal cues for the target.
The language prompt is re-weighted based on the motion cues from the template memory to align with the current scene.
The appearance prompt is generated by filtering out background features in the template using the category or appearance description in the language.
Target Decoding Module:
Treats language-based matching and appearance-based matching as a unified instance retrieval problem.
Comprises a multi-modal query generator that aggregates visual and verbal cues into a holistic object vector, and a query-based target locator that establishes the correspondence between the query vector and the search image.
Directly predicts the target location in an end-to-end manner, ensuring spatio-temporal consistency.
The proposed framework is extensively evaluated on three natural language tracking datasets (TNL2K, OTB-Lang, LaSOT) and a visual grounding dataset (RefCOCOg). The results demonstrate the effectiveness of the multi-modal approach and the superiority of the proposed method compared to state-of-the-art trackers.
Stats
The paper does not provide any specific numerical data or statistics. The focus is on the proposed framework and its evaluation on benchmark datasets.