A novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features for robust active speaker detection in noisy environments.