HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Core Concepts
Hierarchical Acoustic Modeling enhances TTS accuracy and consistency.
Abstract
Token-based text-to-speech models face challenges in pronunciation accuracy, style consistency, and data diversity. HAM-TTS introduces a novel approach with hierarchical acoustic modeling, tailored data augmentation, and synthetic data training. The model incorporates latent variable sequences to improve pronunciation and style consistency. Timbre uniformity is enhanced through strategic data segment replacement. Pretrained voice conversion models generate diverse voices for improved speech diversity. Comparative experiments show HAM-TTS superiority over VALL-E in pronunciation precision and style maintenance.
"During training, we strategically replace and duplicate segments of the data to enhance timbre uniformity."
"Our method incorporates a latent variable sequence containing supplementary acoustic information based on refined self-supervised learning (SSL) discrete units into the TTS model by a predictor."
"Our experiments demonstrate the effectiveness of HAM-TTS in improving pronunciation accuracy, speaking style consistency, and timbre continuity in zero-shot scenarios."