المفاهيم الأساسية
This research paper presents a novel approach to temporal action localization in videos, combining multimodal and unimodal transformers to achieve state-of-the-art results on the Perception Test Challenge 2024 dataset.
الإحصائيات
The proposed method achieved a score of 0.5498 in the Perception Test Challenge 2024.
The baseline model achieved an average mAP of 16.0.
UMT achieved an average mAP of 47.3.
VideoMAEv2 achieved an average mAP of 49.1.
The multimodal model achieved an average mAP of 53.2.
Adding audio features increased the average mAP to 49.5.
Combining different video features increased the average mAP to 51.2.
Augmenting the dataset increased the average mAP to 53.2.
Using WBF increased the average mAP to 54.9.