The paper presents a two-stage multi-task learning framework to build a general-purpose speech and audio encoder that can perform automatic speech recognition (ASR), audio tagging (AT), and speaker verification (SV) simultaneously.
In the first stage, multi-teacher knowledge distillation (KD) is applied to align the feature spaces of three single-task high-performance teacher encoders (for ASR, AT, and SV) into a single student encoder using unlabelled data. This allows the student encoder to learn a unified feature representation suitable for all three tasks.
In the second stage, the pre-trained student encoder is fine-tuned with supervised data for each task. Experiments show that this two-stage approach significantly outperforms a baseline model trained with multi-task learning from scratch. The final system achieves performance close to the best-performing single-task encoders on all three tasks, using only 66M total model parameters.
Key highlights:
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Xiaoyu Yang,... um arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.17010.pdfTiefere Fragen