Conceptos Básicos
The networks learn to denoise and dereverberate the microphone signals to better correlate them and consequently estimate the source position.
Resumen
The paper investigates the use of eXplainable Artificial Intelligence (XAI) techniques, specifically Layer-wise Relevance Propagation (LRP), to analyze two end-to-end deep learning models for speech source localization.
Key highlights:
- The authors inspect the relevance associated with the input features of the two models and discover that both networks denoise and de-reverberate the microphone signals to compute more accurate statistical correlations between them and consequently localize the sources.
- The relevance signals indicate that the networks focus more on the temporal information, such as the onset of the speech signals, rather than the intelligible content of the speech.
- The authors estimate the Time-Difference of Arrivals (TDoAs) via the Generalized Cross Correlation with Phase Transform (GCC-PHAT) using both microphone signals and relevance signals extracted from the two networks, and show that the TDoA estimation is more accurate using the relevance signals.
- The results suggest that the networks leverage on the statistical correlation between the microphone signals to estimate the source position, rather than relying on the speech content.
Estadísticas
The probability of anomalous TDoA estimates is lower when using the relevance signals compared to the microphone signals, especially in more challenging environmental conditions (higher reverberation and noise).
Citas
"The relevance signals indicate which part of the input signals are deemed important by the networks to estimate the source location. This means that both models learned from the microphone signals information that enables to better estimate the TDoA, suggesting, as expected, that the networks leverage on statistical correlation between the microphone signals to estimate the source position."