Основні поняття
Modeling spatial-temporal correlations in point cloud videos as solvable partial differential equations (PDEs) to enhance representation learning and improve performance on downstream tasks.
Анотація
The paper proposes a novel approach called Motion PointNet for point cloud video representation learning by leveraging PDE-solving. The key ideas are:
- PointNet-like Encoder:
- Extends the spatial set abstraction in PointNet++ to the temporal domain by operating on adjacent point cloud frames.
- Maintains the sequence length while aggregating temporal information, enhancing the local information density of the features.
- PDE-solving Module:
- Formulates the process of reconstructing spatial features from temporal features as a PDE-solving problem.
- Employs a combination of multi-head self-attention, spectral methods, and multi-head cross-attention to learn the PDE mapping.
- Uses a contrastive learning structure to guide and refine the PDE-solving process, optimizing the feature representation.
The proposed Motion PointNet outperforms current state-of-the-art methods on multiple point cloud video action recognition benchmarks, including MSRAction-3D, NTU RGB+D, and UTD-MHAD, while maintaining a lightweight model architecture.
Статистика
The paper reports the following key metrics:
MSRAction-3D dataset: Accuracy of 97.52% with only 0.72M parameters and 0.82G FLOPs.
NTU RGB+D dataset: Accuracy of 92.9% (cross-subject) and 98.0% (cross-view) with 1.64M parameters and 15.47G FLOPs.
UTD-MHAD dataset: Accuracy of 92.79%.
Цитати
"We propose a brand-new perspective that views the process of point cloud video representation learning as a PDE-solving problem."
"By modeling spatial-temporal correlations, we aim to regularize spatial variations with temporal features, thereby enhancing representation learning in point cloud videos."
"Remarkably, our Motion PointNet achieves an impressive accuracy of 97.52% on the MSRAction-3D dataset, surpassing the current state-of-the-art in all aspects while consuming minimal resources (only 0.72M parameters and 0.82G FLOPs)."