Concetti Chiave
Proposing an open-vocabulary framework for predicting object poses and sizes using text descriptions and a large-scale dataset.
Sintesi
This study introduces an open-vocabulary category-level object pose and size estimation problem. It presents a framework leveraging pre-trained models to predict normalized object coordinate space (NOCS) maps. The proposed method outperforms baselines on diverse categories, demonstrating generalizability to unseen objects. A large-scale dataset, OO3D-9D, is introduced for training. Experiments show superior performance in estimating poses and sizes of novel categories.
Statistiche
Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image.
OO3D-9D dataset comprises 5,371 objects spanning 216 categories with annotations for symmetry axes.
The proposed method fully leverages visual semantic prior from pre-trained DinoV2 and aligned visual and language knowledge within the text-to-image diffusion model.
Citazioni
"Vision-based object pose estimation is a fundamental problem in computer vision and robotics."
"Our main contributions include introducing a new challenging problem, establishing a benchmark dataset, and proposing an open-vocabulary framework."
"The proposed method significantly outperforms baselines across all metrics."