Grunnleggende konsepter
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. By scaling up the dataset with large-scale unlabeled data and employing effective training strategies, the model exhibits impressive generalization ability across extensive unseen scenes.
Sammendrag
This work presents Depth Anything, a practical solution for robust monocular depth estimation. The key highlights are:
- Data scaling-up with large-scale unlabeled data:
- Collected 62M diverse and informative unlabeled images from various public datasets.
- Automatically annotated the unlabeled images using a pre-trained depth estimation model.
- Jointly trained the model on both labeled and pseudo-labeled data.
- Effective training strategies:
- Challenged the student model with a more difficult optimization target when learning unlabeled images, compelling it to seek extra visual knowledge and acquire robust representations.
- Enforced the model to inherit rich semantic priors from a pre-trained encoder via a feature alignment loss, enhancing its scene understanding capability.
The resulting Depth Anything model exhibits impressive zero-shot depth estimation performance on six unseen datasets, outperforming the state-of-the-art MiDaS model. Further, when fine-tuned with metric depth information, it sets new state-of-the-art results. The trained encoder also serves as a strong multi-task foundation for downstream applications like semantic segmentation.
Statistikk
The model is trained on a total of 1.5M labeled images and 62M unlabeled images.
Sitater
"We highlight the value of data scaling-up of massive, cheap, and diverse unlabeled images for MDE."
"We point out a key practice in jointly training large-scale labeled and unlabeled images. Instead of learning raw unlabeled images directly, we challenge the model with a harder optimization target for extra knowledge."
"We propose to inherit rich semantic priors from pre-trained encoders for better scene understanding, rather than using an auxiliary semantic segmentation task."