Core Concepts
A novel approach called DistilWhisper that enhances the performance of smaller Whisper models for low-resource languages while retaining the advantages of multitask and multilingual capabilities.
Abstract
This work focuses on improving the performance of smaller versions of the Whisper multilingual speech recognition model, which exhibits a significant performance gap compared to the larger models, especially for low-resource languages.
The key insights and contributions are:
Comprehensive analysis of biases in the Whisper model family, including speaker-related (gender, age) and model-related (resourcefulness, model size) biases. The analysis reveals that model-related biases are amplified by quantization, impacting low-resource languages and smaller models more severely.
Introduction of DistilWhisper, a novel approach that combines two key strategies to bridge the performance gap:
Lightweight modular ASR fine-tuning of whisper-small using language-specific experts
Knowledge distillation from whisper-large-v2 to effectively boost ASR performance while retaining the robustness inherited from the multitask and multilingual pre-training.
Extensive experiments demonstrating that DistilWhisper outperforms standard fine-tuning or LoRA adapters, improving performance for targeted low-resource languages on both in-domain and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
Analysis of the gating mechanism in DistilWhisper, showing that it effectively learns to route inputs to the appropriate language-specific modules, leading to significant performance gains.
Exploration of the impact of temperature and distillation loss on the knowledge distillation process, providing insights into the trade-offs between model size, performance, and robustness.
Investigation of the scalability of DistilWhisper, showing its effectiveness in handling an increasing number of languages without compromising performance.
Stats
"Whisper exhibits certain speaker-related biases, such as gender and age, which are kept unchanged after applying quantization to the model."
"Biases associated with the model itself (model-related bias), including language resourcefulness and architecture size, are amplified by quantization."
"Low-resource languages are the most adversely affected by quantization."
"Smaller models experience more significant performance degradation compared to larger ones when quantized."
Quotes
"Can we enhance the performance of smaller models for languages where they currently perform poorly, even though the best model performs well?"
"This phenomenon is often referred to as the curse of multilinguality."
"Recent research findings have demonstrated an alternative solution to the curse of multilinguality, involving equipping moderately sized models with language-specific (LS) modules."