Learning How Actions Sound from Narrated Egocentric Videos
We propose a novel self-supervised multimodal embedding approach that can discover how a wide range of human actions sound from narrated in-the-wild egocentric videos, without relying on curated datasets or predefined action categories.