The paper proposes two approaches to efficiently integrate representations from pre-trained self-supervised speech models (such as Wav2Vec and HuBERT) into end-to-end automatic speech recognition (ASR) architectures:
Subsampled Framewise Addition (SFA): This simple, parameterless approach performs elementwise addition of the subsampled self-supervised representations with the fbank features. While efficient, it has limitations in handling cases where the lengths of the two sequences differ significantly.
Cross Attention (CA): To address the limitations of SFA, the authors introduce a cross-attention layer that allows the model to dynamically learn the alignment between the fbank and self-supervised representations.
The authors evaluate these approaches on the Librispeech and Tedlium datasets, demonstrating significant reductions in word error rate (WER) compared to baseline conformer and transformer models. Key findings include:
Overall, the paper presents efficient and effective methods to incorporate self-supervised speech representations into end-to-end ASR systems, leading to faster training and improved performance.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Darshan Prab... kl. arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.12628.pdfDybere Forespørgsler