Core Concepts
Temporal Masked Autoencoders (T-MAE) improve representation learning in sparse point clouds by incorporating historical frames and leveraging self-supervised pre-training.
Abstract
The scarcity of annotated data in LiDAR point cloud understanding hinders effective representation learning. T-MAE proposes a pre-training strategy that incorporates temporal information to enhance comprehension of target objects. The Siamese encoder and windowed cross-attention module establish a powerful architecture for two-frame input. T-MAE outperforms competitive self-supervised approaches on Waymo and ONCE datasets.
Introduction:
Self-supervised learning addresses the challenge of insufficient labeled data.
Pre-training techniques accelerate model convergence and performance for downstream tasks.
Annotations in LiDAR point clouds are costly, making pre-training essential.
Related Work:
SSL methods focus on contrastive learning and masked image modeling.
Prior works mainly concentrate on synthetic objects and indoor scenes.
T-MAE introduces temporal modeling to leverage historical frames for improved representation learning.
Method:
T-MAE utilizes a Siamese encoder and windowed cross-attention module for temporal dependency learning.
The proposed pre-training strategy reconstructs the current frame using historical information from past scans.
Experiments:
Evaluation on Waymo dataset shows T-MAE outperforms state-of-the-art methods with limited labeled data.
Conclusion:
T-MEA demonstrates the effectiveness of incorporating historical frames in self-supervised pre-training for improved representation learning in sparse point clouds.
Stats
Comprehensive experiments demonstrate that T-MAE achieves higher mAPH for pedestrians when finetuned with half the labeled data than MV-JAR.