The paper introduces a new dataset, HumanML3D++, which expands the existing HumanML3D dataset by adding scene texts to the action texts. This dataset enables the exploration of generating motions from arbitrary texts, going beyond the previous focus on action texts.
The proposed framework consists of two main components:
Think Model: This module uses a large language model (LLM) to extract action labels from the given arbitrary texts, handling both action texts and scene texts.
Act Model: This module employs a transformer-based generative model to generate the final motion sequences from the extracted action labels.
The authors conduct extensive experiments to evaluate the performance of their framework and compare it with existing state-of-the-art methods. The results demonstrate that the proposed two-stage approach can effectively generate high-quality and diverse human motions from arbitrary texts, outperforming previous methods that were limited to action texts.
The key highlights of the paper include:
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies