Kernkonzepte
Incorporating a large Hawaiian language model into the Whisper automatic speech recognition foundation model can provide a small but significant improvement in transcription accuracy for Hawaiian audio.
Zusammenfassung
The authors address the challenge of improving Automatic Speech Recognition (ASR) for the low-resource language Hawaiian by incorporating large amounts of independent Hawaiian text data into the Whisper ASR foundation model.
Key highlights:
The authors evaluate the zero-shot transfer performance of different Whisper model sizes on a manually curated Hawaiian ASR test set. The largest Whisper models (large and large-v2) achieve the best baseline performance with word error rates (WERs) around 22%.
To leverage the available Hawaiian text data, the authors train an external Hawaiian language model (LM) on ~1.5M words of modern Hawaiian text. They then use this LM to rescore the Whisper outputs.
Rescoring the large-v2 Whisper model with the Hawaiian LM provides a small but significant improvement, reducing the WER to around 20%.
The authors analyze example ASR predictions and identify challenges the model faces, such as accurately capturing Hawaiian phonemes like glottal stops and long vowels that differ from English.
The authors discuss ways to further improve Hawaiian ASR by leveraging additional unlabeled text and audio data through techniques like self-supervised learning and pseudo-labeling.
Statistiken
The training set for the Hawaiian LM consists of 45,769 lines, 1,547,831 words, or 7,573,569 characters.
The validation set for the Hawaiian LM consists of 888 lines, 26,607 words, or 129,487 characters.
The ASR test set consists of 57 audio-text pairs, comprising 1,120 words and a total audio duration of 7 minutes and 35.336 seconds.
Zitate
"Mai ho'om¯
auna i ka 'ai: Language Models Improve Automatic Speech Recognition in Hawaiian"
"Leave nothing to waste" or "Use all the data you have"