Learning Spatial Audio-Visual Representations from Egocentric Videos through Binaural Audio Inpainting
A self-supervised method for learning spatial audio-visual representations from egocentric videos by solving the pretext task of inpainting masked binaural audio segments using visual and audio cues.