This research introduces HALL-E, a novel text-to-speech (TTS) model that leverages hierarchical neural codecs and language models to synthesize high-quality, minute-long speech from text in a single inference step, overcoming limitations of previous TTS models in handling long-form speech.
본 논문에서는 사전 학습된 신경망 오디오 코덱(NAC) 모델의 프레임 속도를 줄이고, 이러한 저프레임 속도 토큰을 활용하여 장시간 음성 합성을 가능하게 하는 계층적 LLM 기반 TTS 모델인 HALL-E를 제안합니다.
This research introduces "Bahasa Harmony," a comprehensive dataset for Bahasa Indonesian text-to-speech synthesis, and "EnGen-TTS," a novel TTS model based on neural codec language modeling, achieving state-of-the-art performance in speech quality and efficiency.
F5-TTS is a novel non-autoregressive TTS system that leverages flow matching with Diffusion Transformers and a novel Sway Sampling strategy to achieve fast, fluent, and faithful speech synthesis with strong zero-shot capabilities.
MaskGCT is a novel, fully non-autoregressive TTS model that leverages masked generative transformers to synthesize high-quality speech without requiring explicit text-speech alignment or phone-level duration prediction, achieving human-level similarity, naturalness, and intelligibility.
DMDSpeech is a novel text-to-speech model that leverages distilled diffusion and direct metric optimization to achieve state-of-the-art performance in zero-shot speech synthesis, surpassing even ground truth audio in speaker similarity while significantly reducing inference time.
Quantizing inherently continuous modalities like audio for text-to-speech synthesis may be suboptimal, and continuous representation learning using per-token latent diffusion models like SALAD offers a competitive alternative with superior intelligibility.
Continuous speech tokenization enhances text-to-speech synthesis by preserving more audio information than traditional discrete methods, leading to improved speech continuity, quality, and robustness to sampling rate variations.
ControlSpeech is a novel TTS system that leverages a decoupled codec and a novel Style Mixture Semantic Density (SMSD) module to achieve simultaneous zero-shot speaker cloning and flexible style control, addressing limitations of previous models that could not independently manipulate content, timbre, and style.