Core Concepts
NeRF-MAE introduces a self-supervised framework for enhancing 3D representation learning by pretraining on the radiance and density grid of Neural Radiance Fields (NeRFs), leading to substantial improvements in various downstream 3D applications.
Abstract
The paper proposes NeRF-MAE, a self-supervised framework for 3D representation learning using Neural Radiance Fields (NeRFs). The key components are:
- Explicit 4D radiance and density grid extraction from a trained NeRF model in the canonical world frame using camera-trajectory aware sampling.
- A masked self-supervised pretraining module that trains a 3D SwinTransformer encoder and a voxel decoder using an opacity-aware masked reconstruction objective.
NeRF-MAE is pretrained on a large-scale dataset of over 1.6M images and 3,500+ scenes from 4 different sources (Front3D, HM3D, Hypersim, ScanNet).
The pretraining approach shows significant improvements over self-supervised 3D pretraining and NeRF scene understanding baselines on various downstream tasks:
- 3D Object Detection: NeRF-MAE achieves 21.5% AP50 and 8% AP25 improvement on Front3D and ScanNet datasets, respectively.
- Semantic Voxel Labeling: NeRF-MAE shows 9.6% mIOU and 11.2% mAcc improvement on Front3D.
- Voxel Super-Resolution: NeRF-MAE achieves better PSNR and lower MSE compared to the baseline.
The paper also demonstrates that NeRF-MAE can effectively learn representations from more unlabeled data and higher-quality NeRFs, highlighting its scalability and adaptability.
Stats
"We achieved an absolute AP50 improvement of 10% when adding 10x the number of scenes for pretraining (1515 vs 151 scenes)."
"Our results show that we can learn better representations, achieving downstream AP25 improvement of 36% when adding higher quality NeRFs to our pretraining (18 vs 28 2D PSNR)."
Quotes
"NeRF-MAE consistently outperforms all competing methods with a clear margin."
"Specifically, we achieve +7.2% Recall50 improvement and +3% AP50 improvement, hence demonstrating an absolute improvement number of +2.5% Recall50 and +0.5% AP50 on Scannet 3D OBB prediction task."
"We also achieve a +9.6% mIOU improvement and +11.2% mAcc improvement on Front3D voxel labelling, hence demonstrating an absolute improvement number of +6% mIOU and +7% mAcc over the best competing 3D pretraining baseline."