核心概念
We propose a method to learn a category-level 3D object pose estimator without requiring any pose annotations. By leveraging diffusion models to generate multiple views of objects and an image encoder to extract robust features, our model can learn the 3D pose correspondence from the generated views and outperform state-of-the-art methods on few-shot category-level pose estimation benchmarks.
摘要
The paper proposes a method to learn a category-level 3D object pose estimator without requiring any pose annotations. The key ideas are:
- Leveraging diffusion models (e.g., Zero-1-to-3) to generate multiple views of objects with controlled poses from a single unannotated image.
- Introducing an image encoder to filter out unreasonable details and extract robust image features from the generated views.
- Jointly optimizing neural meshes for each object instance to learn the 3D pose correspondence from the generated views, and merging the learned meshes into a category-level representation.
The experiments show that the proposed method can perform category-level 3D object pose estimation using a single shot annotation as the pose definition, and significantly outperform state-of-the-art few-shot category-level pose estimation methods on the PASCAL3D+ and KITTI datasets, even without using any pose annotations during training.
統計資料
The paper does not provide any specific numerical data or metrics in the main text. The key results are presented in the form of qualitative visualizations and comparisons to state-of-the-art methods on benchmark datasets.
引述
The paper does not contain any direct quotes that are particularly striking or support the key logics.