Core Concepts
Our method leverages the semantic priors of large pretrained language models to convert natural language descriptions of physical interactions into mathematical constraints, which can then be optimized to refine 3D pose estimates and accurately capture self-contact and person-to-person contact.
Abstract
The paper presents a novel zero-shot framework, called ProsePose, that leverages the implicit knowledge of large multimodal language models (LMMs) to improve 3D human pose estimation. The key insight is that since language is often used to describe physical interactions, LMMs can act as priors on pose estimation.
The method works as follows:
It first obtains an initial 3D pose estimate using a regression-based pose estimation model.
It then prompts an LMM (GPT4-V) with the image and a request to identify all pairs of body parts that are touching. The LMM generates a list of contact constraints.
These contact constraints are converted into a loss function that can be optimized jointly with other common losses, such as 2D keypoint loss, to refine the initial pose estimates.
The authors demonstrate that ProsePose produces more accurate 3D pose reconstructions compared to previous zero-shot approaches on three 2-person interaction datasets and one dataset of complex yoga poses. The method is able to correctly capture self-contact and person-to-person contact without requiring any additional training data or annotations.
The key advantages of ProsePose are:
It leverages the implicit semantic knowledge in LMMs to guide pose optimization, avoiding the need for expensive contact annotations required by previous methods.
It provides a unified framework for resolving both self-contact and person-to-person contact, which are challenging for state-of-the-art pose regression and optimization methods.
It is a zero-shot approach that can be applied to new images without any additional training.
Overall, the paper demonstrates the potential of using language models as priors for 3D human pose estimation, particularly in scenarios involving physical interactions and contact.
Stats
"Their faces are touching as they lean into each other"
"The yogi reaches their hands back to touch their heels."
Quotes
"Our central insight is that since language is often used to describe physical interaction, large pretrained text-based models can act as priors on pose estimation."
"We can thus leverage this insight to improve pose estimation by converting natural language descriptors, generated by a large multimodal model (LMM), into tractable losses to constrain the 3D pose optimization."