Text-Guided Visual Sound Source Localization in Multi-Source Mixtures
The core message of this paper is to leverage the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g., AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures for improved visual sound source localization.