The paper introduces a new task of video-based Activated Muscle Group Estimation (AMGE) in the wild, which aims to identify active muscle regions during physical activities in unconstrained environments. To enable research in this area, the authors provide the MuscleMap dataset, which contains over 15,000 video clips of 135 different physical activities with binary annotations for 20 muscle groups.
The authors benchmark several existing approaches, including CNN-based, transformer-based, and graph convolutional network (GCN)-based models, on the MuscleMap dataset. They find that while skeleton-based models perform well on new activity types, video-based models perform better on known activity types. To address this, the authors propose TRANSM3E, a cross-modality knowledge distillation and fusion architecture that combines RGB video and skeleton data.
TRANSM3E introduces three key components: Multi-Classification Tokens (MCT), Multi-Classification Tokens Knowledge Distillation (MCTKD), and Multi-Classification Tokens Fusion (MCTF). MCT expands the prediction space for attributes, MCTKD enables effective cross-modality knowledge transfer, and MCTF integrates the distilled knowledge and classification tokens for the final prediction. The proposed TRANSM3E model outperforms all baselines, including the state-of-the-art MViTv2, on both known and new activity types, demonstrating superior generalizability.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Kunyu Peng,D... um arxiv.org 04-30-2024
https://arxiv.org/pdf/2303.00952.pdfTiefere Fragen