核心概念
HYPERTTS, a parameter-efficient approach for adapting text-to-speech models to new speakers, utilizes a hypernetwork to dynamically generate adapter parameters conditioned on speaker representations, outperforming static adapter-based methods while achieving comparable performance to full fine-tuning.
要約
The paper presents HYPERTTS, a novel approach for parameter-efficient adaptation of text-to-speech (TTS) models to new speakers. The key highlights are:
- Dynamic Adapters: HYPERTTS learns speaker-adaptive adapters by conditioning them on speaker embeddings, unlike static adapters used in previous work.
- Parameter Sampling: HYPERTTS employs a learnable hypernetwork to generate the adapter parameters, enabling a continuous parameter space for efficient adaptation to new speakers.
- Parameter Efficiency: HYPERTTS achieves competitive performance compared to full fine-tuning, while using less than 1% of the backbone model parameters, making it highly practical and resource-friendly.
The authors conduct extensive experiments on the LibriTTS and VCTK datasets, comparing HYPERTTS to baselines like full fine-tuning and static adapters. The results demonstrate that HYPERTTS can outperform static adapters and achieve performance close to full fine-tuning, while being significantly more parameter-efficient. Subjective evaluations also show that HYPERTTS can generate natural-sounding speech that is comparable to the full fine-tuned model.
The paper also explores the impact of parameter efficiency by varying the size of the hypernetwork, and analyzes the output of the hypernetwork to understand its behavior in generating dynamic adapter parameters.
統計
The backbone TTS model pre-trained on the LibriTTS dataset achieves a COS score of 73.794, FFE of 39.19, WER of 0.2035, and MCD of 5.9232 on the VCTK test set in a zero-shot setting (TTS-0).
Full fine-tuning of the backbone model on the VCTK dataset (TTS-FT) improves the COS score to 80.443, reduces FFE to 34.63, WER to 0.2027, and MCD to 5.2387.
HYPERTTS with the hypernetwork in the decoder (HYPERTTSd) achieves a COS score of 77.590, FFE of 38.55, WER of 0.2090, and MCD of 5.9641, using only 0.423% of the backbone parameters.
Increasing the hypernetwork parameter size in HYPERTTSd from 2 to 128 dimensions improves the COS score from 75.89 to 80.26.
引用
"HYPERTTS is aimed to make adapters significantly more effective by conditioning them on speaker embeddings, thus enhancing the (effective) learnable parameter space of adapters."
"Notably, to keep the network small in size, we leverage a shared hypernetwork to generate parameters for adapters in every layer of a given module of the TTS backbone."
"Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems."