核心概念
A novel two-stage multi-task learning framework is proposed to build a general-purpose speech and audio encoder that jointly performs automatic speech recognition, audio tagging, and speaker verification.
摘要
The paper presents a two-stage multi-task learning framework to build a general-purpose speech and audio encoder that can perform automatic speech recognition (ASR), audio tagging (AT), and speaker verification (SV) simultaneously.
In the first stage, multi-teacher knowledge distillation (KD) is applied to align the feature spaces of three single-task high-performance teacher encoders (for ASR, AT, and SV) into a single student encoder using unlabelled data. This allows the student encoder to learn a unified feature representation suitable for all three tasks.
In the second stage, the pre-trained student encoder is fine-tuned with supervised data for each task. Experiments show that this two-stage approach significantly outperforms a baseline model trained with multi-task learning from scratch. The final system achieves performance close to the best-performing single-task encoders on all three tasks, using only 66M total model parameters.
Key highlights:
- Proposed a two-stage multi-task multi-teacher KD training pipeline for ASR, AT, and SV.
- Demonstrated that the multi-teacher KD pre-training using unlabelled data is necessary to align different tasks and leads to better performance on each task.
- Found that ASR and SV should be performed at different encoder depths to achieve a balance between the two tasks.
統計資料
The paper reports the following key metrics:
Word error rate (WER) on LibriSpeech test-clean and test-other datasets for ASR.
Mean average precision (mAP) on the AudioSet evaluation set for audio tagging.
Equal error rate (EER) on the VoxCeleb1 test set for speaker verification.
引述
"The final system achieves good performance on ASR, AT and SV: with less than 4% relative word-error-rate increase on ASR, only 1.9 lower mean averaged precision on AT and 0.23% absolute higher equal error rate on SV compared to the best-performing single-task encoders, using only a 66M total model parameters."