Generating High-Quality and Diverse Two-Person Interaction Motions with Text Guidance
核心概念
Our approach, InterGen, enables the generation of high-quality and diverse two-person interaction motions from text prompts by introducing a novel multimodal dataset, cooperative denoising networks, and effective spatial relation modeling.
要約
The paper presents InterGen, a diffusion-based approach for generating high-quality and diverse two-person interaction motions from text prompts. The key contributions are:
-
InterHuman dataset: The authors contribute a new multimodal dataset, InterHuman, which contains about 107 million frames of two-person interaction motions with 23,337 natural language descriptions. This is the largest and most diverse dataset for human-to-human interaction motions.
-
Cooperative denoising networks: The authors introduce a novel denoising architecture with two cooperative transformer-style networks that share weights and use a mutual attention mechanism. This design encourages the two networks to perform the same operations and yield the same motion capacity, effectively avoiding mode collapse during interaction motion generation.
-
Spatial relation modeling: The authors propose a non-canonical motion representation that explicitly encodes the global spatial relations between the two interacting people. They also introduce two additional regularization losses, a joint distance map loss and a relative orientation loss, to further model the complex spatial relations during human-to-human interactions.
-
Damping scheme: The authors adapt a damping scheme for the regularization losses during training, especially when the sampled timestamp of the diffusion process reaches specific thresholds, to achieve more diverse generation.
Extensive experiments on the InterHuman dataset demonstrate that the proposed InterGen approach can generate more compelling two-person interaction motions than previous methods, and showcase various downstream applications such as trajectory control, interactive motion inbetweening, and person-to-person generation.
InterGen
統計
"We have recently seen tremendous progress in diffusion advances for generating realistic human motions."
"Yet, they largely disregard the multi-human interactions."
"Our dataset was amassed through two distinct sessions: daily motion and professional motion."
"The former encompasses a spectrum of high-frequency interactions encountered in everyday life, while the latter is tailored to capture professional interaction performances, consisting of 10 specific categories of expert skills."
"Our InterHuman dataset is the largest and most diverse known scripted dataset of human-to-human interactions, consisting of 7779 motions derived from various categories of human actions, labeled with 23,337 unique descriptions composed of 5656 distinct words, with a total duration of 6.56 hours."
引用
"InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions"
"We have recently seen tremendous progress in diffusion advances for generating realistic human motions. Yet, they largely disregard the multi-human interactions."
"Our dataset was amassed through two distinct sessions: daily motion and professional motion."
深掘り質問
How can the proposed InterGen approach be extended to handle more than two interacting people?
The InterGen approach can be extended to handle more than two interacting people by adapting the cooperative denoising architecture to accommodate multiple individuals. Instead of just two denoisers sharing weights, the model can be scaled to include additional denoisers for each additional person involved in the interaction. These denoisers can interact cooperatively, sharing weights and incorporating mutual attention mechanisms to ensure consistency and coherence in the generated motions. By expanding the network architecture to include multiple denoisers, the model can effectively capture the complex interactions between multiple individuals and generate realistic multi-agent interaction motions.
What are the potential applications of the InterHuman dataset beyond motion generation, such as in the field of human behavior analysis?
The InterHuman dataset has a wide range of potential applications beyond motion generation, particularly in the field of human behavior analysis. Some of the key applications include:
Behavioral Studies: Researchers can use the dataset to analyze and understand human interactions in various scenarios, providing insights into social dynamics, communication patterns, and emotional expressions.
Virtual Reality and Simulation: The dataset can be utilized to create realistic human interactions in virtual environments, enhancing the immersion and authenticity of virtual reality experiences.
Healthcare and Therapy: The dataset can be used to develop applications for behavioral therapy, social skills training, and communication enhancement for individuals with social interaction difficulties.
Human-Robot Interaction: The dataset can inform the design of robots and AI systems that interact with humans, improving their ability to understand and respond to human behaviors effectively.
Security and Surveillance: The dataset can be used for analyzing and detecting suspicious or abnormal behaviors in security and surveillance systems, enhancing public safety and security measures.
How can the spatial relation modeling techniques developed in this work be applied to other domains involving multi-agent interactions, such as robotics or autonomous driving?
The spatial relation modeling techniques developed in this work can be applied to other domains involving multi-agent interactions, such as robotics or autonomous driving, in the following ways:
Robotics: In robotics, the spatial relation modeling techniques can be used to enhance collaborative tasks among multiple robots. By modeling the spatial relationships between robots, they can coordinate their actions more effectively, leading to improved task performance and efficiency.
Autonomous Driving: In autonomous driving scenarios, the spatial relation modeling techniques can help autonomous vehicles understand the positions and movements of other vehicles on the road. This information can be crucial for safe and efficient navigation, especially in complex traffic situations.
Crowd Management: The techniques can also be applied to crowd management scenarios, where multiple agents (such as pedestrians or vehicles) interact in dynamic environments. By modeling spatial relations, it becomes easier to predict and control the movements of agents, leading to better crowd flow and safety.
By incorporating spatial relation modeling techniques into these domains, it is possible to optimize interactions between multiple agents, improve coordination, and enhance overall system performance.