Core Concepts
The author proposes CustomListener, a user-friendly framework for generating listener head motions guided by text priors. The approach involves dynamic portrait tokens and past-guided motion generation to achieve controllable and interactive responses.
Abstract
CustomListener introduces a framework for generating listener head motions based on text priors. It addresses the limitations of existing methods by allowing users to customize listener attributes and achieve realistic, controllable responses. The SDP module transforms static portraits into dynamic ones, while the PGG module ensures coherence between segments. Extensive experiments confirm the effectiveness of CustomListener in achieving state-of-the-art performance in listener motion generation.
Stats
ViCo dataset contains three parts: train set Dtrain, test set Dtest, out-of-domain set Dood.
ViCo dataset annotations include emotion labels, activated AUs, and head movements detected by Hopenet.
RealTalk dataset is used as a database for retrieving listener videos.
45-dim acoustic features extracted from audios include MFCC, MFCC-Delta, energy, Zero Crossing Rate, and loudness.
Resolution of facial videos is 256x256 with an FPS of 30.
Quotes
"The applications of listener agent generation in virtual interaction have promoted many works achieving diverse and fine-grained motion generation."
"Users can pre-customize detailed attributes of the listener agent."
"Our proposed CustomListener achieves lowest FD on both Dtest and Dood subsets in ViCo."