Utilizing temporal convolution and GPT-2 enhances AU detection accuracy by integrating audio-visual data for nuanced emotional expression understanding.
The author presents the first publicly available multimodal dataset, REWIND, for speaking status segmentation from body movement in real-life mingling scenarios with high-quality audio recordings.