核心概念
This paper proposes a novel two-stage framework to generate human motions from arbitrary texts, including both action texts and scene texts, by leveraging the strengths of large language models and transformer-based motion generation.
摘要
The paper introduces a new dataset, HumanML3D++, which expands the existing HumanML3D dataset by adding scene texts to the action texts. This dataset enables the exploration of generating motions from arbitrary texts, going beyond the previous focus on action texts.
The proposed framework consists of two main components:
-
Think Model: This module uses a large language model (LLM) to extract action labels from the given arbitrary texts, handling both action texts and scene texts.
-
Act Model: This module employs a transformer-based generative model to generate the final motion sequences from the extracted action labels.
The authors conduct extensive experiments to evaluate the performance of their framework and compare it with existing state-of-the-art methods. The results demonstrate that the proposed two-stage approach can effectively generate high-quality and diverse human motions from arbitrary texts, outperforming previous methods that were limited to action texts.
The key highlights of the paper include:
- Introducing the HumanML3D++ dataset with scene text annotations to enable the study of generating motions from arbitrary texts.
- Proposing a novel two-stage framework that leverages the strengths of LLMs and transformer-based motion generation.
- Demonstrating the effectiveness of the proposed approach in generating diverse and realistic human motions from arbitrary texts, including both action texts and scene texts.
- Providing insights into the challenges and opportunities in the practical application of text-to-motion generation.
統計資料
"A person notices his wallet on the ground ahead"
"A person takes a few steps forward and then bends down to pick up something"
引述
"Exploring the generation of potential motions from arbitrary texts is important."
"Compared to them, it is more practical to generate motions from arbitrary texts (the right figure in Figure 1), such as 'A person notices his wallet on the ground ahead'."