核心概念
Generative pre-training with flow matching in speech technology shows promising results for various downstream tasks.
统计
Specifically, we pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
SpeechFlow is trained with unlabeled speech with the goal of estimating the underlying distribution of speech conditioning on masked audio.
For each task, fine-tuned SpeechFlow is able to match expert models.
SpeechFlow demonstrated strong generalizability with a clear gap on PESQ, CSIG, and COVL against all other methods.
SpeechFlow was able to provide strong separation results, showcasing better intelligibility in all cases.
引用
"Generative models have gained more and more attention in recent years for their remarkable success in tasks that required estimating and sampling data distribution to generate high-fidelity synthetic data."
"Our work suggested a foundational model for generation tasks in speech can be built with generative pre-training."