toplogo
Sign In

Audio-Driven Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model


Core Concepts
This work presents a novel motion-decoupled framework to directly generate audio-driven co-speech gesture videos, without relying on structural human priors. The key innovations include a nonlinear TPS transformation to extract latent motion features, a transformer-based diffusion model to capture the temporal correlation between gestures and speech, and a refinement network to enhance visual details.
Abstract
The paper proposes a unified framework for audio-driven co-speech gesture video generation. The core innovations are: Motion Decoupling Module with TPS: Introduces a nonlinear TPS transformation to extract latent motion features that preserve essential appearance information, in contrast to previous methods that rely on 2D poses or linear motion representations. The keypoint predictor and optical flow predictor are used to generate TPS transformations and guide the synthesis of gesture video frames. Latent Motion Diffusion Model: Designs a transformer-based diffusion model to generate motion features conditioned on speech, capturing the inherent temporal dependencies between gestures and speech. Proposes an optimal motion selection module to produce long-term coherent and consistent gesture videos. Refinement Network: Employs a UNet-like architecture with residual blocks to focus on missing details in certain regions, such as occluded areas and complex textures like hands and faces. The framework is evaluated on the PATS dataset, outperforming existing methods in both motion-related and video-related metrics. User studies also demonstrate the superiority of the generated gesture videos in terms of realness, diversity, synchrony, and overall quality.
Stats
The average distance between closest speech beats and gesture beats is 0.1506. The Fr´ echet Video Distance (FVD) of the generated videos is 2058.19.
Quotes
"Our proposed approach significantly outperforms existing methods on motion-related metrics of FGD (56.44%) and Diversity (8.54%), which reveals that our motion-decoupled and diffusion-based generation framework is capable of generating realistic and diverse gestures in the motion space." "When compared with the ground truth containing rich details, although generated motion is realistic, they are inevitably influenced by appearance factors. This demonstrates that human perception of motion and appearance are interrelated."

Deeper Inquiries

How can the proposed framework be extended to generate co-speech gestures for multiple speakers in a conversational setting

To extend the proposed framework to generate co-speech gestures for multiple speakers in a conversational setting, several modifications and enhancements can be implemented. Firstly, the keypoint predictor and TPS transformation module can be adapted to handle multiple speakers by incorporating speaker identification features. This would allow the system to differentiate between speakers and generate gestures accordingly. Additionally, the latent motion diffusion model can be modified to accept inputs from multiple speakers and learn the temporal dependencies between gestures and speech from different speakers. The optimal motion selection module can also be adjusted to select and combine motion features from different speakers to create coherent and synchronized gesture videos in a conversational setting.

What are the potential applications of the generated co-speech gesture videos beyond human-machine interaction, such as in the entertainment or education domains

The generated co-speech gesture videos have various potential applications beyond human-machine interaction. In the entertainment industry, these videos can be used to enhance storytelling in animated films or virtual reality experiences by adding realistic and expressive gestures to characters. In education, the videos can be utilized to create engaging and interactive learning materials, such as sign language tutorials or language learning tools. Moreover, in the field of virtual communication and teleconferencing, the generated gesture videos can improve the non-verbal communication aspect, making virtual interactions more engaging and natural.

Can the motion-decoupled approach be applied to other types of human motion generation tasks, such as dance or sign language, to improve the quality and diversity of the generated outputs

The motion-decoupled approach proposed in the framework can indeed be applied to other types of human motion generation tasks, such as dance or sign language, to enhance the quality and diversity of the generated outputs. By utilizing TPS transformations to extract latent motion features and a transformer-based diffusion model to learn temporal correlations, the system can generate realistic and diverse movements for various applications. For dance generation, the framework can capture intricate choreography details and produce expressive dance sequences. Similarly, for sign language generation, the approach can generate accurate and natural sign gestures, improving the overall quality of sign language communication tools and resources.
0