核心概念
Two efficient approaches to incorporate representations from pre-trained self-supervised speech models into end-to-end ASR architectures, resulting in faster training and significant performance gains.
要約
The paper proposes two approaches to efficiently integrate representations from pre-trained self-supervised speech models (such as Wav2Vec and HuBERT) into end-to-end automatic speech recognition (ASR) architectures:
-
Subsampled Framewise Addition (SFA): This simple, parameterless approach performs elementwise addition of the subsampled self-supervised representations with the fbank features. While efficient, it has limitations in handling cases where the lengths of the two sequences differ significantly.
-
Cross Attention (CA): To address the limitations of SFA, the authors introduce a cross-attention layer that allows the model to dynamically learn the alignment between the fbank and self-supervised representations.
The authors evaluate these approaches on the Librispeech and Tedlium datasets, demonstrating significant reductions in word error rate (WER) compared to baseline conformer and transformer models. Key findings include:
- HuBERT representations consistently outperform Wav2Vec representations.
- The cross-attention approach outperforms the subsampled framewise addition method.
- The proposed architectures achieve faster convergence, requiring only 10 epochs to outperform the baseline.
- Even with a significant reduction in the number of encoder layers (up to 80%), the models still perform much better than the baseline.
- Visualization of the attention scores shows that the model has learned meaningful alignments between the fbank and self-supervised representations.
- The approach is effective on unseen datasets, with the HuBERT-LARGE model performing the best on the Tedlium dataset.
Overall, the paper presents efficient and effective methods to incorporate self-supervised speech representations into end-to-end ASR systems, leading to faster training and improved performance.
統計
The paper reports the following key metrics:
Word error rate (WER) on the Librispeech-100 and Tedlium2 datasets
Number of model parameters
Training time
引用
"While directly incorporating HuBERT or Wav2vec as a frontend/Encoder is bound to perform better, our approach results in a model that is both faster in training and smaller in size at the cost of a slight degradation in performance."
"Even after reducing the number of encoder layers by 80%, we find that the model still performs much better despite having only half as many parameters and training time as the baseline model."