Dense Video Object Captioning: Detecting, Tracking, and Describing Object Trajectories in Videos
The core message of this article is to propose a new task and model for dense video object captioning, which involves detecting, tracking, and captioning object trajectories in videos. The authors design an end-to-end model that can be trained with a mixture of disjoint datasets, enabling zero-shot capabilities and strong initialization for further finetuning.