Core Concepts
DistilWhisper proposes a method to bridge the performance gap in automatic speech recognition for under-represented languages by leveraging language-specific experts and knowledge distillation.
Abstract
Abstract:
Whisper model covers 99 languages with commendable ASR results.
DistilWhisper bridges ASR performance gap using language-specific experts and knowledge distillation.
Introduction:
Whisper's robustness attributed to multitask training.
Performance gap between whisper-large-v2 and whisper-small on various languages.
DistilWhisper:
Approach involves lightweight ASR fine-tuning and knowledge distillation.
Extends whisper-small with LS feed-forward layers for improved performance.
CLSR modules introduced for flexible routing at token-level.
Experimental Setup:
Datasets include CommonVoice 13.0 and FLEURS for evaluation.
Language selection based on WER gap between large and small models.
Models compared include whisper-small, whisper-large-v2, standard fine-tuning, LoRA adapters, CLSR-FT, and DistilWhisper.
Results:
DistilWhisper outperforms other adaptation approaches in both in-domain and out-of-domain test sets.
Effectiveness demonstrated across different training data sizes.
Stats
モデルは99言語をカバーし、ASRの結果が優れている。
DistilWhisperは、LSフィードフォワードレイヤーを使用してwhisper-smallを拡張し、パフォーマンスを向上させる。
Quotes
"Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters."
"Our lightweight ASR fine-tuning approach generalizes better than LoRA."