The author presents the first publicly available multimodal dataset, REWIND, for speaking status segmentation from body movement in real-life mingling scenarios with high-quality audio recordings.
Recognizing speaking in humans using multimodal signals for privacy-preserving segmentation.
Recognizing speaking in humans using machine learning models trained on video and wearable sensor data.