核心概念
This work proposes a multi-view attention consistency method using directed Gromov-Wasserstein discrepancy to ensure that the action recognition model focuses on the proper action subject across different camera views.
要約
The paper presents a novel approach for multi-view action recognition that aims to ensure the consistency of the model's attention across different camera views. The key contributions are:
Introduction of a multi-view attention consistency method to solve the problem of reasonable prediction in action recognition.
Definition of a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy, which maintains the motion information and spatial structure of the attention.
Development of an action recognition model based on Video Transformer and Neural Radiance Fields to obtain attention from different views, even for single-view datasets.
The proposed method is evaluated on three large-scale action recognition datasets - Jester, Something-Something V2, and Kinetics-400. The ablation study on the Jester dataset demonstrates the effectiveness of the directed Gromov-Wasserstein loss in improving the model's performance compared to the regular Gromov-Wasserstein loss and no consistency loss. The experiments on the other two datasets also show that the proposed approach achieves state-of-the-art results.
統計
The paper reports the following key metrics:
Top-1 and Top-5 recognition accuracy on the Jester, Something-Something V2, and Kinetics-400 datasets.