DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction
The author introduces DiffSal, a novel diffusion architecture for generalized audio-visual saliency prediction, utilizing input video and audio as conditions. The framework outperforms previous state-of-the-art methods across six challenging benchmarks.