The paper presents TeSMo, a novel approach for generating text-controlled, scene-aware human motions. The key insights are:
Leveraging a pre-trained text-to-motion diffusion model as the base, and fine-tuning it with an augmented scene-aware component to enable scene-awareness while maintaining text controllability.
Decomposing the motion generation task into two stages - navigation and interaction. The navigation model generates a scene-aware root trajectory that reaches a goal location near the target object, which is then lifted to a full-body motion. The interaction model directly generates the full-body motion conditioned on the start pose, goal pose, and 3D object geometry.
Creating specialized datasets for training the scene-aware components - Loco-3D-FRONT for navigation and an extended SAMP dataset with text annotations for interactions.
The experiments demonstrate that TeSMo outperforms prior scene-aware and scene-agnostic methods in terms of goal-reaching, collision avoidance, and the plausibility of generated human-object interactions, while preserving the realism and text-following ability of the base diffusion model.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문