Core Concepts
Recognizing speaking in humans using machine learning models trained on video and wearable sensor data.
Abstract
Introduces the REWIND dataset for speaking status segmentation.
Challenges of obtaining individual voice recordings in mingling scenarios.
Baselines for no-audio speaking status segmentation from video, body acceleration, and body pose tracks.
Importance of high-quality audio recordings for cross-modality studies.
Implications for social signal processing and computational social science.
Stats
"High-quality speaking status signals have been obtained from personal head-mounted and directional microphones in seated meetings."
"Acceleration readings were obtained from wearable devices in a badge-like form factor worn by data subjects on the chest."
"Video recordings include top-down and side-elevated views."
Quotes
"Recognizing speaking in humans is a central task towards understanding social interactions."
"Machine learning models trained on video and wearable sensor data make it possible to recognize speech by detecting its related gestures in an unobtrusive, privacy-preserving way."