A scalable paradigm for reconstructing hand-held objects from monocular RGB images by jointly inferring hand and object geometry, and leveraging large language/vision models for automated 3D object retrieval and alignment.
The core message of this paper is to propose the task of Hand-Object Stable Grasp Reconstruction (HO-SGR), which jointly optimizes the reconstructions of hands and objects across all frames within a stable grasp. The authors showcase that objects move within one degree of freedom (1-DoF) relative to the hand pose throughout the stable grasp, and accordingly propose a method that jointly reconstructs the hands and objects by minimizing the object's motion to 1-DoF.