Our proposed framework leverages features from pre-trained multi-modal models CLIP and CLAP to achieve state-of-the-art performance on audio-visual generalized zero-shot learning benchmarks.
The core message of this paper is to leverage the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures for improved visual sound source localization.
A single shared vision transformer backbone can effectively process both audio and visual inputs, leading to an efficient and scalable audio-visual pretraining framework that outperforms prior approaches using separate audio and visual encoders.
EquiAV introduces a novel framework leveraging equivariance for audio-visual contrastive learning, outperforming previous methods across various benchmarks.