核心概念
The core message of this paper is that using a Wav2Vec2 model pre-trained on automatic speech recognition (ASR) tasks can outperform models pre-trained on self-supervised learning (SSL) tasks for the assessment of speech intelligibility and severity in patients with head and neck cancers, even with limited training data.
要約
This paper explores the use of pre-trained Wav2Vec2 models for the task of assessing speech intelligibility and severity in patients with head and neck cancers. The authors compare the performance of Wav2Vec2 models pre-trained on self-supervised learning (SSL) tasks versus those pre-trained on ASR tasks.
The key highlights are:
- The authors propose training the model on the entire audio file rather than segmenting it, to preserve important context and speech information.
- They find that the Wav2Vec2 model pre-trained on ASR tasks outperforms the SSL-trained model, achieving an average MSE of 0.73 for intelligibility prediction and 1.15 for severity prediction, using only 95 training samples.
- This ASR-based approach establishes a new baseline, outperforming previous state-of-the-art systems by a significant margin.
- Further analysis shows the model's ability to generalize across different speech content and its robustness to varying segment durations.
- The strong correlation between ASR performance and speech quality assessment suggests a deep connection between the two tasks.
統計
"The system achieved an average best MSE at 0.73 for the intelligibility prediction task and 1.15 for the severity prediction task."
"The proposed architectures achieved outstanding results without requiring data augmentation."
引用
"Remarkably, most of the systems we proposed consistently outperform these existing baseline systems. With our best model, we achieved between 58% to 75% MSE reduction compared with the two baselines reported above for intelligibility assessment and between 40% to 62% to MSE reduction for severity assessment within the context of SpeeCOmco corpus."
"While comparing feature extractor based on pre-trained SSL with pre-trained ASR, it is surprising that pre-trained ASR extractor outperformed the pre-trained SSL one. Not only with better average MSE, pre-trained ASR extractor also shows a more consistent performance with a significantly smaller standard deviation."