The paper presents a novel framework, InterDreamer, that can generate 3D human-object interaction sequences from text descriptions in a zero-shot setting, without requiring text-interaction paired data for training. The key insights are:
Interaction semantics and dynamics can be decoupled. The high-level semantics of an interaction, aligned with its textual description, can be informed by human motion and the initial object pose. The low-level dynamics of the interaction, specifically the subsequent behavior of the object, are governed by the forces exerted by the human, within the constraints of physical laws.
The semantics of interaction can be harnessed from a variety of prior knowledge that is independent of text-interaction pair datasets, such as a large language model (LLM) and a pre-trained text-to-motion model.
The dynamics of the interaction can be learned from motion capture data without the necessity of text annotations, by developing a novel world model that predicts the subsequent state of an object affected by the interaction.
The framework integrates these components to generate text-aligned 3D human-object interaction sequences in a zero-shot manner. Experiments on the BEHAVE and CHAIRS datasets demonstrate the capability of InterDreamer to generate realistic and coherent interaction sequences that seamlessly align with the text directives.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Sirui Xu,Ziy... at arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.19652.pdfDeeper Inquiries