Core Concepts
This study provides a detailed comparative analysis of the performance of various Automatic Speech Recognition (ASR) models, including Whisper, on the Fearless Steps APOLLO corpus of historical NASA Apollo mission communications. The key focus is on identifying and understanding subgroup-specific performance variations, with the goal of informing advancements in ASR systems for Earth-to-space communications.
Abstract
The study explores the performance of different ASR models, including Whisper, on the Fearless Steps APOLLO corpus of historical NASA Apollo mission communications. The key objectives are:
-
Subgroup Performance Assessment:
- The authors extract interpretable metadata about the audio recordings, transcripts, and speakers.
- They identify subgroups of recordings based on combinations of the extracted metadata.
- The performance (Word Error Rate) is computed for each subgroup and the difference in performance ("divergence") is analyzed compared to the overall population.
-
Pre-trained vs. Fine-tuned Models:
- The impact of fine-tuning on ASR models is investigated by comparing the performance of pre-trained models with their fine-tuned counterparts.
- The results show that fine-tuning consistently improves global ASR performance and reduces the divergence across subgroups.
-
Base vs. Large Models:
- The performance of base and small ASR models is compared in a zero-shot setting.
- The findings indicate that smaller models can outperform larger models in certain subgroups, despite the overall better performance of larger models.
-
Multilingual vs. English-only Models:
- The performance disparity between multilingual and English-only ASR models is examined, identifying subgroups for which the multilingual model consistently outperforms the English-only counterpart.
- While most subgroups benefit from the English-only model, a small number of subgroups actually decrease in performance when using the multilingual model.
The insights gained from this study enhance the understanding of subgroup-specific performance variations in ASR systems, paving the way for advancements in the development and optimization of ASR for Earth-to-space communications.
Stats
The study reports the following key statistics:
The Word Error Rate (WER) for the most negatively and positively divergent subgroups compared to the overall test performance for the medium-en and medium-multilingual models.
The minimum, maximum, and average divergences, as well as the standard deviation of the divergences, for various models (base, small, medium, large-v3) in both zero-shot and fine-tuned settings.
The WER performance gap when changing the model size (base vs. small), the pre-training objective (medium-en vs. medium-multi), or the training methodology (zero-shot or fine-tuned).