The paper introduces the task of Generalizable Temporal Action Localization (GTAL), which focuses on improving the generalization of action localization methods across different data distributions. The authors analyze the performance degradation of existing weakly-supervised temporal action localization (WTAL) methods when transferring to different distributions, and find that the main issue lies in the localization rather than classification.
To address this, the authors propose STAT (Self-supervised Temporal Adaptive Teacher), a framework based on a teacher-student structure. The key components are:
Refinement Module: This module iteratively refines the teacher model's attention output to better adapt to the target dataset's scale, using a salience sampling strategy and a refinement rule to maintain ranking consistency.
Alignment Module: This module aligns the output of the student and teacher models, including attention, classification scores, and their calibration, to guide the student model's adaptation to the target distribution.
The authors conduct extensive experiments on three datasets - THUMOS14, ActivityNet1.2, and HACS. The results show that STAT significantly improves the baseline methods under the cross-distribution evaluation setting, even approaching the same-distribution performance.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Yangcen Liu,... lúc arxiv.org 04-23-2024
https://arxiv.org/pdf/2404.13311.pdfYêu cầu sâu hơn