Core Concepts
Effective framework for generating realistic listener motions through Dyadic Interaction Modeling.
Abstract
Human-human communication involves complex interactions, requiring models to capture dyadic context. The proposed framework, DIM, utilizes pre-training and contrastive learning to generate lifelike facial expressions and head motions. Extensive experiments show superior performance in listener motion generation, establishing a new state-of-the-art. The approach addresses limitations of existing methods by modeling bidirectional interactions and enhancing diversity in generated motions. DIM-Listener and DIM-Speaker demonstrate the ability to create realistic behaviors from audio-visual inputs.
Stats
CANDOR dataset consists of 1,656 conversations in English.
ViCo dataset includes 483 video sequences with 50 unique listeners.
LM_Listener dataset contains 2366 training segments with a single listener (Trevor Noah).
Quotes
"We present an effective framework for creating 3D facial motions in dyadic interactions."
"Our method not only generates listener behaviors from speaker audio-visual inputs but could also adeptly produce speaker facial motions."
"Extensive experiments demonstrate the superiority of our framework in generating listener motions."