The paper introduces a novel task of generating 3D holistic human motions for both speakers and listeners in human communication. The key highlights are:
HoCo dataset: The authors introduce a new dataset, HoCo, which provides high-quality videos of two-person communication, along with multi-modal annotations including audio, text, and 3D pseudo-ground-truth labels for facial expressions, body poses, and hand gestures of both speakers and listeners.
Audio feature decoupling: The authors propose a method to decouple the audio features into content, style, and semantic components, which facilitates fine-grained control over the generation of expressions and body motions.
Chain-like transformer model: The authors devise a transformer-based auto-regressive model with a chain-like structure to capture the real-time mutual influence between the speaker and the listener, enabling the simultaneous generation of coordinated and diverse motions for both.
Experiments: The authors demonstrate state-of-the-art performance on the HoCo dataset and two other benchmarks, outperforming previous methods in terms of motion diversity, synchronization, and appropriateness of the generated listener reactions.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Mingze Sun,C... at arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.19467.pdfDeeper Inquiries