Efficiently integrating time and frequency dimensions in audio-visual speech separation improves performance while reducing computational complexity.
RTFS-Net introduces a novel time-frequency domain audio-visual speech separation method that outperforms existing models in both efficiency and quality.