Core Concepts
Different data augmentation techniques, including SpecAugment, Gaussian Noise, and Speed Perturbation, can improve the robustness of self-supervised speech recognition models like HuBERT and wav2vec when evaluated on augmented test sets, but may degrade performance on the original dataset.
Abstract
The paper explores the impact of different speech data augmentation techniques on the performance of self-supervised speech recognition models, specifically HuBERT and wav2vec, on the tasks of Phoneme Recognition (PR) and Automatic Speech Recognition (ASR).
Key highlights:
The authors used the S3PRL toolkit to experiment with three data augmentation techniques: SpecAugment, Gaussian Noise, and Speed Perturbation.
For the PR task, the SpecAugment dataset and Gaussian Noise dataset were used for finetuning, while for the ASR task, the SpecAugment dataset and Speed Perturbation dataset were used.
The results show that SpecAugment slightly improves the performance of HuBERT and wav2vec on the original dataset for both PR and ASR tasks.
Models finetuned on Gaussian Noise and Speed Perturbation datasets perform better on their respective augmented test sets, indicating improved robustness to those types of noise.
However, the Gaussian Noise and Speed Perturbation finetuned models perform worse on the original dataset compared to the baseline, suggesting a trade-off between robustness and generalization.
The authors plan to explore the use of sequentially augmented datasets and evaluate the performance on different datasets beyond LibriSpeech in the future.
Stats
The original Phoneme Recognition (PR) task PER for HuBERT-Baseline is 6.38% and for wav2vec-Baseline is 32.53%.
The original Automatic Speech Recognition (ASR) task WER for HuBERT-Baseline is 6.84% and for wav2vec-Baseline is 18.78%.
HuBERT-Gaussian-Noise achieved a PER of 13.10% on the Gaussian Noise test set.
wav2vec-Gaussian-Noise achieved a PER of 70.67% on the Gaussian Noise test set.
HuBERT-Speed-Perturbation achieved a WER of 21.63% on the Speed Perturbation test set.
wav2vec-Speed-Perturbation achieved a WER of 34.22% on the Speed Perturbation test set.