The authors describe their systems developed for the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge, which covers text-to-speech (TTS), singing voice synthesis (SVS), and automatic speech recognition (ASR) tasks.
For the TTS track, the authors used two types of discrete tokens - semantic tokens from wav2vec2.0 and acoustic tokens from FunCodec. They developed a modified VQTTS system with a conformer-based acoustic model and a discrete unit-based vocoder. Their best submission achieved the top rank on the leaderboard with a low bitrate of 250bps.
In the SVS track, the authors used Descript Audio Codec (DAC) as the discrete tokens and a modified VALL-E system as the SVS pipeline. They found that utilizing only the initial layer of DAC features as discrete tokens provided a good balance between bitrate and perceptual quality.
For the ASR track, the authors used k-means clusters of WavLM features as the discrete tokens and a Zipformer-based neural transducer architecture. Their system achieved a relative character error rate reduction of up to 13% compared to the baseline, albeit at a higher bitrate.
The authors conclude that their experimental findings can promote better understanding and utilization of discrete speech tokens in the speech community.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Yiwei Guo,Ch... lúc arxiv.org 04-10-2024
https://arxiv.org/pdf/2404.06079.pdfYêu cầu sâu hơn