核心概念
TANGO is a novel framework that generates realistic and synchronized co-speech gesture videos by combining a hierarchical audio-motion embedding space for accurate gesture retrieval and a diffusion-based interpolation network for seamless transitions between retrieved video segments.
Liu, H., Yang, X., Akiyama, T., Huang, Y., Li, Q., Kuriyama, S., & Taketomi, T. (2024). TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation. arXiv preprint arXiv:2410.04221.
This paper introduces TANGO, a novel framework designed to generate high-fidelity, audio-synchronized co-speech gesture videos from a short reference video and target speech audio. The research aims to address the limitations of existing gesture video generation methods, particularly in terms of audio-motion misalignment and visual artifacts.