The paper investigates what is learnt by the different components of the LEArnable Front-end (LEAF) model, a general-purpose audio front-end designed for audio event classification. The LEAF model consists of three learnable components (Gabor filterbank, Gaussian low-pass filters, and Per-Channel Energy Normalisation (PCEN)) and one non-learnable component (Energy Estimation).
The authors train LEAF on three different speech processing tasks (keyword spotting, emotion recognition, and language identification) and analyze the changes in the characteristics of the learnable components before and after training. The results show that only the PCEN layer undergoes significant changes during training, while the Gabor filterbank and Gaussian low-pass filters remain largely unchanged from their initial values.
Based on this finding, the authors propose a noise adaptation scheme where only the PCEN layer is adapted using a small amount of noisy data. They compare the performance of this adapted LEAF model to a model trained entirely on noisy data, as well as a model trained on clean data without adaptation. The results demonstrate that adapting the PCEN layer can effectively mitigate the impact of noise on the model's performance, without the need for a large amount of noisy training data.
The key insights from this work are:
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문