Core Concepts
RELI11D is a high-quality multimodal human motion dataset that provides synchronized LiDAR, RGB, IMU, and Event data, enabling comprehensive understanding of complex and rapid human movements. The authors propose LEIR, a multimodal baseline that effectively integrates the geometric, appearance, and motion dynamics information from these modalities to achieve promising results for human pose estimation and global trajectory prediction.
Abstract
The authors present RELI11D, a comprehensive multimodal human motion dataset that captures the movements of 10 actors performing 5 different sports in 7 scenes. The dataset includes synchronized data from LiDAR, RGB cameras, Event cameras, and IMU sensors, providing a rich set of modalities to enable holistic understanding of human motions.
Key highlights:
RELI11D is the first dataset to combine LiDAR, RGB, IMU, and Event modalities for human motion capture.
The dataset includes 3.32 hours of data, with 239k frames of human body point clouds, and annotations for global poses and trajectories.
The authors demonstrate that existing state-of-the-art methods perform poorly on RELI11D due to the complex and rapid motions, highlighting the challenges posed by the dataset.
To address these challenges, the authors propose LEIR, a multimodal baseline that effectively integrates the geometric information from LiDAR, the appearance features from RGB, and the motion dynamics from Event data through a cross-attention fusion strategy. Experiments show that LEIR outperforms existing methods on both human pose estimation and global trajectory prediction tasks, demonstrating the benefits of leveraging multiple modalities.
The authors make both the dataset and the source code publicly available, fostering collaboration and further exploration in this field.
Stats
The dataset contains 3.32 hours of synchronized LiDAR point clouds, IMU measurements, RGB videos, and Event streams.
It includes 239k frames of human body point clouds captured from 10 actors performing 5 different sports in 7 scenes.
The dataset provides annotations for global human poses and trajectories.