The authors describe their systems developed for the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge, which covers text-to-speech (TTS), singing voice synthesis (SVS), and automatic speech recognition (ASR) tasks.
For the TTS track, the authors used two types of discrete tokens - semantic tokens from wav2vec2.0 and acoustic tokens from FunCodec. They developed a modified VQTTS system with a conformer-based acoustic model and a discrete unit-based vocoder. Their best submission achieved the top rank on the leaderboard with a low bitrate of 250bps.
In the SVS track, the authors used Descript Audio Codec (DAC) as the discrete tokens and a modified VALL-E system as the SVS pipeline. They found that utilizing only the initial layer of DAC features as discrete tokens provided a good balance between bitrate and perceptual quality.
For the ASR track, the authors used k-means clusters of WavLM features as the discrete tokens and a Zipformer-based neural transducer architecture. Their system achieved a relative character error rate reduction of up to 13% compared to the baseline, albeit at a higher bitrate.
The authors conclude that their experimental findings can promote better understanding and utilization of discrete speech tokens in the speech community.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania