핵심 개념
To enhance the generalization ability of temporal action localization models across different datasets, the authors propose a self-supervised framework called STAT that leverages a teacher-student structure with refinement and alignment modules.
초록
The paper introduces the task of Generalizable Temporal Action Localization (GTAL), which focuses on improving the generalization of action localization methods across different data distributions. The authors analyze the performance degradation of existing weakly-supervised temporal action localization (WTAL) methods when transferring to different distributions, and find that the main issue lies in the localization rather than classification.
To address this, the authors propose STAT (Self-supervised Temporal Adaptive Teacher), a framework based on a teacher-student structure. The key components are:
-
Refinement Module: This module iteratively refines the teacher model's attention output to better adapt to the target dataset's scale, using a salience sampling strategy and a refinement rule to maintain ranking consistency.
-
Alignment Module: This module aligns the output of the student and teacher models, including attention, classification scores, and their calibration, to guide the student model's adaptation to the target distribution.
The authors conduct extensive experiments on three datasets - THUMOS14, ActivityNet1.2, and HACS. The results show that STAT significantly improves the baseline methods under the cross-distribution evaluation setting, even approaching the same-distribution performance.
통계
The median duration of action instances in THUMOS14 is 3.0 seconds, while those in ActivityNet1.2 and HACS are 28.5 seconds and 11.2 seconds, respectively.
Compared to the same-distribution (SmD) setting, the cross-distribution (CrD) setting leads to a significant 12.4% mean Average Precision (mAP) discrepancy for existing methods.
In the CrD setting, the classification accuracy of high-attention snippets only decreases by 6.6%, while the accuracy of all snippets decreases by 51.3%.
인용구
"Weakly-supervised Temporal Action Localization (WTAL), which focuses on using video-level annotations to identify and classify actions in time, has attracted considerable attention [12,22,33,43,49,55,78]. Despite the advancement, most existing methods operate under the assumption that training and testing data are independent and identically distributed, but this assumption often does not hold in real-world scenarios."
"To address this problem, we introduce a novel setting, termed Generalizable Temporal Action Localization (GTAL). Specifically, GTAL consists of two settings: training and evaluating on the sharing action categories of the same-distribution (SmD), and cross-distribution evaluation (CrD). The SmD setting overlaps with the traditional evaluation protocol, while the CrD setting, which evaluates the generalization ability, is rarely used previously."