Conceitos essenciais
A novel test-time adaptation approach, T3AL, that adapts pre-trained Vision and Language Models to localize and recognize actions in untrimmed videos without requiring any training data.
Resumo
The paper proposes a novel method, T3AL, to address the problem of Zero-Shot Temporal Action Localization (ZS-TAL) without access to any training data.
The key insights are:
- Existing ZS-TAL methods rely on fine-tuning on large annotated datasets, which can be impractical and lead to poor out-of-distribution generalization.
- T3AL adapts a pre-trained Vision and Language Model (VLM) at test-time, without any training, to localize and recognize actions in untrimmed videos.
- T3AL operates in three steps:
- Compute a video-level pseudo-label by aggregating information from the entire video.
- Perform action localization using a novel self-supervised learning procedure.
- Refine the action region proposals using frame-level textual descriptions from a captioning model.
- Experiments on THUMOS14 and ActivityNet-v1.3 datasets show that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, demonstrating the benefits of the test-time adaptation approach.
- Oracle experiments further reveal the potential of the test-time adaptation strategy to surpass current training-based ZS-TAL methods without requiring any labeled data.
Estatísticas
The average video representation computed by averaging the frame-level visual features can be used to identify the video-level pseudo-label. (Eq. 1, Eq. 2)
The scores of the visual frames are computed and refined by adapting the VLM at test-time using a self-supervised learning objective. (Eq. 3-10)
Frame-level textual descriptions extracted from a captioning model are used to perform text-guided region suppression. (Eq. 11)
Citações
"Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training."
"While model fine-tuning has the clear objective of learning video representations, which allows to effectively localize actions in the untrimmed videos, it also assumes the availability of a large annotated data collection. In certain applications, however, such datasets may be unavailable."
"Motivated by these observations, in this work we propose to investigate the problem of ZS-TAL under a novel perspective, featuring the relevant scenario where training data is inaccessible."