Core Concepts
Temporal Masked Autoencoders (T-MAE) improve representation learning in sparse point clouds by incorporating historical frames and leveraging self-supervised pre-training.
Abstract
The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. T-MAE proposes a pre-training strategy that incorporates temporal information to enhance comprehension of target objects. The Siamese encoder and windowed cross-attention module establish a powerful architecture for two-frame input. T-MAE outperforms competitive self-supervised approaches on Waymo and ONCE datasets.
- Introduction:
- Self-supervised learning addresses the challenge of insufficient labeled data.
- Pre-training techniques accelerate model convergence and performance for downstream tasks.
- Annotations in LiDAR point clouds are costly, making pre-training essential.
- Related Work:
- SSL methods focus on contrastive learning and masked image modeling.
- Prior works mainly concentrate on synthetic objects and indoor scenes.
- T-MAE introduces temporal modeling to leverage historical frames for improved representation learning.
- Method:
- T-MAE utilizes a Siamese encoder and windowed cross-attention module for temporal dependency learning.
- The proposed pre-training strategy reconstructs the current frame using historical information from past scans.
- Experiments:
- Evaluation on Waymo dataset shows T-MAE outperforms state-of-the-art methods with limited labeled data.
- Conclusion:
T-MEA demonstrates the effectiveness of incorporating historical frames in self-supervised pre-training for improved representation learning in sparse point clouds.
Stats
Comprehensive experiments demonstrate that T-MAE achieves higher mAPH for pedestrians when finetuned with half the labeled data than MV-JAR.