The paper analyzes how existing video action detection (VAD) methods form features for classification and finds that they often prioritize the actor regions while overlooking the essential contextual information necessary for accurate classification. To address this issue, the authors propose a new model architecture that assigns a class-dedicated query to each action class, allowing the model to dynamically determine where to focus for effective classification.
The key components of the proposed model are:
3D Deformable Transformer Encoder: This modified transformer encoder efficiently processes multi-scale feature maps to capture various levels of semantics and details.
Localizing Decoder Layer (LDL): LDL constructs features containing information related to actors and provides informative features to the classification module.
Classifying Decoder Layer (CDL): CDL leverages class queries and actor-specific context features to generate classification features that are dedicated to each class and each actor simultaneously.
The authors demonstrate that their model outperforms existing state-of-the-art methods on three challenging VAD benchmarks (AVA, JHMDB51-21, and UCF101-24) while being more computationally efficient. The key advantages of the proposed model are its ability to:
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Jinsung Lee,... at arxiv.org 09-12-2024
https://arxiv.org/pdf/2407.19698.pdfDeeper Inquiries