Core Concepts
Our method can efficiently and reliably recover the relative 6D pose of an unseen object in novel views given only a single view of that object as reference, without requiring the object's 3D model or retraining on the new object.
Abstract
The paper introduces a novel approach, called NOPE (Novel Object Pose Estimation), for 6D object pose estimation from a single image. Unlike existing methods, NOPE does not require the 3D model of the target object or retraining on new objects.
The key idea is to train a deep network to predict the average appearance of the object under different viewpoints, conditioned on the reference view and the desired relative pose. At inference, NOPE matches the query image against the predicted appearances to estimate the relative pose, and can also identify pose ambiguities due to symmetries or partial occlusions.
The authors show that NOPE outperforms state-of-the-art methods on both synthetic and real-world datasets, in terms of accuracy and robustness to partial occlusions. NOPE is also significantly faster than previous approaches, taking only around 1 second on a single GPU.
The paper makes the following key contributions:
A novel approach for 6D object pose estimation from a single image, without requiring the 3D model or retraining on new objects.
The ability to predict the distribution of possible poses, which allows identifying pose ambiguities.
Extensive experiments demonstrating the superior performance of NOPE compared to existing methods, in terms of accuracy, robustness, and runtime.
Stats
"Given as input a single reference view of a novel object, our method predicts the relative 3D pose (rotation) of a query view and its ambiguities."
"Our method works by estimating a probability distribution over the space of 3D poses, visualized here on a sphere centered on the object."
"Our method is also robust to partial occlusions, as shown on the clock hidden in part by a rectangle in the query image."
Quotes
"The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects."
"To achieve this, we train NOPE to predict the appearance of the object under novel views. We use these predictions as 'templates' annotated with the corresponding poses. Matching these templates with new input views lets us estimate the object relative pose with respect to the initial view."
"Our approach to handling the ambiguities in novel view synthesis for template matching is to consider the distribution of all the possible appearances of the object for the target viewpoint."