The paper investigates the presence of visual biases in two representative audio-visual source localization (AVSL) benchmarks, VGG-SS and Epic-Sounding-Object. Through extensive observations and experiments, the authors find that in over 90% of the cases in each benchmark, the sounding objects can be accurately localized using only visual information, without the need for audio input.
To further validate this finding, the authors evaluate vision-only models on the AVSL benchmarks. For VGG-SS, they use the large vision-language model MiniGPT-v2, which outperforms all existing audio-visual models on the benchmark. For Epic-Sounding-Object, they employ a hand-object interaction detector (HOID) that also surpasses the performance of dedicated AVSL models.
These results clearly demonstrate the significant visual biases present in the existing AVSL benchmarks, which undermine their ability to effectively evaluate audio-visual learning models. The authors provide qualitative analysis and discuss potential strategies to mitigate these biases, such as filtering out data instances that can be easily solved by vision-only models. The findings suggest the need for further refinement of AVSL benchmarks to better support the development of audio-visual learning systems.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Liangyu Chen... alle arxiv.org 09-12-2024
https://arxiv.org/pdf/2409.06709.pdfDomande più approfondite