The paper presents SnAG, a scalable and accurate model for temporal video grounding. The key insights are:
Cross-modal fusion is crucial for video grounding models. The authors analyze the computational cost of early fusion vs. late fusion, and show that late fusion scales better for long videos with many queries.
The authors propose a video-centric training scheme that reuses video representations across multiple queries, leading to significant efficiency gains compared to the conventional query-centric training.
SnAG is a simple instantiation of late fusion and video-centric training. It uses a multi-scale Transformer-based video encoder, a Transformer-based text encoder, and a lightweight cross-attention module for fusion. The model decodes moment candidates as points using convolutional heads.
Experiments show that SnAG outperforms state-of-the-art methods on long-form video grounding benchmarks like MAD and Ego4D-NLQ, while also achieving competitive results on short-video datasets like Charades-STA and ActivityNet-Captions. SnAG is 43% more accurate and 1.5x faster than the previous best method on the MAD dataset.
Ablation studies confirm the effectiveness of late fusion and video-centric training. Further analysis demonstrates the efficiency gains of SnAG in both training and inference.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Fangzhou Mu,... at arxiv.org 04-04-2024
https://arxiv.org/pdf/2404.02257.pdfDeeper Inquiries