Core Concepts
Metric3D v2 is a geometric foundation model that can accurately estimate metric depth and surface normals from a single image, enabling zero-shot generalization to diverse scenes and camera settings.
Abstract
The paper introduces Metric3D v2, a geometric foundation model for zero-shot metric depth and surface normal estimation from a single image. Existing methods either struggle with metric depth recovery due to camera parameter ambiguity or have limited zero-shot performance for surface normal estimation due to the lack of diverse training data.
To address these challenges, the authors propose several key innovations:
Canonical Camera Transformation Module: This module explicitly resolves the metric ambiguity problem by transforming training data to a canonical camera space, enabling the network to learn metric depth estimation without being affected by varying camera parameters.
Random Proposal Normalization Loss: This loss function enhances the local depth contrast by randomly cropping patches and normalizing them, improving the overall depth accuracy.
Joint Depth-Normal Optimization: The authors propose a joint optimization module that iteratively refines depth and normal predictions, allowing the normal estimator to learn from the abundant depth annotations even without normal labels.
With these innovations, the authors are able to scale up the training to over 16 million images from diverse datasets, leading to state-of-the-art performance on a wide range of depth and normal benchmarks. The versatile Metric3D v2 model enables accurate metric 3D recovery from single images, benefiting various downstream applications such as monocular SLAM and 3D reconstruction.
Stats
"Sensor size and pixel size do not affect the metric depth estimation."
"The focal length is vital for metric depth estimation."
Quotes
"To recover da from a single image, focal length, imaging size of the object, and real-world object size must be available."
"Estimating the focal length from a single image is a challenging and ill-posed problem."