Core Concepts
The core message of this article is to propose a new task and model for dense video object captioning, which involves detecting, tracking, and captioning object trajectories in videos. The authors design an end-to-end model that can be trained with a mixture of disjoint datasets, enabling zero-shot capabilities and strong initialization for further finetuning.
Abstract
The article proposes a new task called "Dense Video Object Captioning" (Dense VOC), which involves detecting, tracking, and captioning object trajectories in videos. This task unifies spatial and temporal localization in video, while also requiring fine-grained visual understanding that is best described by natural language.
The authors design an end-to-end model for this task, with three main components:
Object proposal generator: This module produces class-agnostic object proposals per frame.
Tracking module: This module assigns unique identities to the object proposals across frames, using a novel end-to-end tracking algorithm.
Captioning module: This module aggregates the tracked object features and generates captions for the object trajectories.
The authors propose a training strategy based on a mixture of disjoint tasks and datasets, which allows them to leverage diverse, large-scale datasets that supervise different parts of the model. This enables zero-shot capabilities and serves as a strong initialization for further finetuning.
The authors carefully design new evaluation metrics that capture all components of the Dense VOC task, and show how they can repurpose existing video grounding datasets (VidSTG and VLN) for this new task. They demonstrate that their model outperforms a number of strong baselines, and can also be applied to the task of spatial grounding, outperforming prior state-of-the-art on VidSTG and VLN without explicit training for it.
Stats
There are 19,000 captioned trajectories in the VidSTG dataset for training and testing.
The VLN dataset contains a total of 5,588 training and 3,071 validation captioned trajectories.