Kernekoncepter
ChatPose is a multimodal Large Language Model (LLM) that can directly generate 3D human poses represented as SMPL parameters from text or image inputs, and reason about human poses using its general world knowledge.
Resumé
The paper introduces ChatPose, a multimodal Large Language Model (LLM) that can understand and reason about 3D human poses. The key contributions are:
ChatPose embeds SMPL poses as distinct signal tokens within the LLM, enabling it to directly generate 3D body poses from both textual and visual inputs. This allows the LLM to leverage its powerful capabilities for tasks beyond traditional 3D pose estimation and generation.
The paper introduces two novel tasks that require reasoning about human poses: Speculative Pose Generation (SPG) and Reasoning-based Pose Estimation (RPE). These tasks go beyond classical pose estimation and generation by requiring the model to apply its general world knowledge to infer and reason about human poses.
Experiments show that ChatPose outperforms existing multimodal LLMs and task-specific methods on the newly proposed SPG and RPE tasks, demonstrating its ability to understand and reason about 3D human poses.
The paper first provides an overview of related work in 3D human pose estimation, language and pose, and multimodal LLMs. It then details the architecture and training of ChatPose, which integrates SMPL pose as a distinct modality within the LLM.
The key innovation is that by embedding SMPL poses within the LLM, ChatPose can leverage the LLM's general world knowledge to reason about human poses in complex ways. This enables the two new tasks of Speculative Pose Generation and Reasoning-based Pose Estimation, where the model must infer and generate 3D poses based on high-level textual descriptions or scene context, rather than explicit pose instructions.
Experiments show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these new reasoning-focused tasks, while also performing competitively on classical 3D pose estimation and generation. This demonstrates the power of integrating 3D human pose understanding within a general-purpose multimodal LLM.
Statistik
The SMPL pose of this person is .
The SMPL pose of the person is .
The SMPL format of this person's pose is .
The SMPL pose of the person wearing a green shirt is .
Citater
"The SMPL pose is ."
"Sure, it is ."
"The SMPL pose of the person is ."