The paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to the speech emotion recognition task. Although PTMs have shown remarkable performance across various tasks, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size.
To address these limitations, the authors propose a pipeline that generates a task-specific PTM that is both compact and effective. They focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper.
The key components of Vesper are:
Emotion-guided masking strategy: Vesper employs an emotion-guided masking strategy that leverages the energy of the input speech signal to identify the regions that need masking, enhancing its sensitivity to emotional information.
Hierarchical self-supervision: Vesper employs hierarchical self-supervision to separately supervise the shallow and deep layers of the model, improving its ability to capture both acoustic and semantic representations, which are crucial for emotion recognition.
Cross-layer self-supervision: Vesper employs cross-layer self-supervision to make the output representation more informative and balanced, containing both rich semantic information and sufficient acoustic information.
The authors evaluate Vesper on three widely used emotion recognition datasets: IEMOCAP, MELD, and CREMA-D. The experimental results demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers, while being more computationally efficient.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Weidong Chen... a las arxiv.org 04-19-2024
https://arxiv.org/pdf/2307.10757.pdfConsultas más profundas