Leveraging Audio-Visual Modalities to Enhance Target Speech Extraction: A Speech Chain Approach
The proposed AVSepChain framework leverages the concept of the speech chain to alleviate the modality imbalance issue in audio-visual target speech extraction tasks by dynamically shifting the dominant and conditional roles between audio and visual modalities.