Conceptos Básicos
The proposed AVSepChain framework leverages the concept of the speech chain to alleviate the modality imbalance issue in audio-visual target speech extraction tasks by dynamically shifting the dominant and conditional roles between audio and visual modalities.
Resumen
The paper introduces the AVSepChain framework, which aims to extract the target speaker's speech from a mixed audio signal by leveraging the target speaker's lip movements as a guiding condition. The framework consists of two stages: speech perception and speech production.
In the speech perception stage, the AV-Separator utilizes the target speaker's lip movements as a conditional modality to extract the target speech from the mixed audio. The audio modality is treated as the dominant modality in this stage.
In the speech production stage, the roles are reversed, with the visual modality becoming the dominant modality and the audio modality serving as the conditional modality. The AV-Synthesizer generates the residual signal of the target speech based on the lip movements, using the preliminary target speech extracted in the first stage as a condition.
To ensure that the generated target speech captures the same semantic information as the lip movements, the authors introduce a contrastive semantic matching loss. This loss aligns the frame-level representations of the generated speech and the lip movements, which are extracted using pre-trained audio and audio-visual models, respectively.
The authors conduct extensive experiments on multiple benchmark datasets for audio-visual target speech extraction, demonstrating the superior performance of the proposed AVSepChain framework compared to state-of-the-art methods. The results highlight the effectiveness of the speech chain concept in addressing the modality imbalance issue and enhancing the overall quality of the extracted target speech.
Estadísticas
The paper reports the following key metrics:
Scale-Invariant Signal-to-Noise Ratio Improvement (SI-SNRi)
Signal-to-Distortion Ratio Improvement (SDRi)
Perceptual Evaluation of Speech Quality (PESQ)
Word Error Rate (WER)
Citas
"By leveraging the audio and video modalities as conditional information for each other, we simulate the speech perception and production processes in the speech chain, thereby alleviating the modality imbalance issue in AV-TSE."
"To ensure that the generated speech of the target speaker possesses the same semantic information as its lip movements, we employ a contrastive semantic matching loss."