One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features
The proposed method combines a Multi-scale Video Analysis (MVA) module and a Video-Text Alignment (VTA) module to effectively detect a wide range of actions in open-vocabulary settings, outperforming existing methods.