Core Concepts
The author introduces the HAM-TTS model, emphasizing its hierarchical acoustic modeling approach and data augmentation strategy to enhance zero-shot text-to-speech synthesis.
Abstract
The HAM-TTS model addresses challenges in token-based TTS models by incorporating a novel hierarchical acoustic modeling approach. It focuses on improving pronunciation accuracy, speaking style consistency, and timbre continuity through innovative strategies and extensive training data.
The model integrates a latent variable sequence with refined self-supervised learning units to mitigate pronunciation errors and style mutations. Data augmentation techniques are strategically employed to enhance timbre uniformity in synthesized speech.
Additionally, a few-shot voice conversion model is utilized to generate diverse voices with consistent content but varied timbres, enriching speech diversity. Comparative experiments demonstrate the model's superiority over existing methods in terms of pronunciation precision and maintaining speaking style.
The paper provides a comprehensive overview of the research methodology, including related works, experimental setup, results analysis, and future research directions.
Stats
The data size was scaled up to 650k hours.
The zero-shot TTS model had 0.8B parameters.
The CER for HAM-TTS-S was 4.0%.
Quotes
"Our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning discrete units into the TTS model."
"Comparative experiments demonstrate our model’s superiority over VALL-E in pronunciation precision and maintaining speaking style."
"The substantial requirement for large and diverse training data further limits their widespread adoption."