Core Concepts
This paper describes the development of an industrial-scale, multilingual automatic speech recognition (ASR) system that leverages a diverse training dataset and a robust model architecture to achieve competitive performance and practical advantages over state-of-the-art open-source models.
Abstract
The paper presents the development of an industrial-scale multilingual ASR system by AssemblyAI. Key highlights:
Training Data:
Utilized a diverse dataset comprising 12.5M hours of unsupervised data, 188k hours of supervised data, and 1.6M hours of pseudo-labeled data across four languages (English, Spanish, German, French).
Employed data filtering and quality control measures to ensure high-quality reference transcriptions.
Model Architecture:
Adopted a Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder.
The model has around 600M parameters, significantly smaller than larger open-source models like Whisper and Canary-1B.
Evaluation:
Achieved competitive word error rates (WERs) against larger and more computationally expensive models on various test sets for English and non-English languages.
Demonstrated improved code-switching capability, reduced hallucination rate, and better timestamp accuracy compared to open-source models.
Achieved a 5x inference speedup compared to an optimized Whisper baseline for long-form audio.
Insights:
Highlighted the importance of a system-centric approach to analyzing various practical aspects of fully-fledged ASR models, beyond just WER.
Demonstrated the benefits of leveraging large-scale, diverse training data and a robust model architecture for building a reliable, industrial-grade ASR system.
Stats
Our model achieved a 5x inference speedup compared to an optimized Whisper baseline for long-form audio.
Our model showed a 30% reduction in hallucination rate on speech data compared to Whisper.
Our model achieved a 90% reduction in ambient noise fabrication compared to Whisper.
Quotes
"This work served as a foundation for building Universal-1, AssemblyAI's commercial ASR system."
"Our analysis reveals that ASR models trained on multilingual corpora exhibit an ability of handling code-switching even though the training dataset did not contain code-switching samples."