This research paper presents a novel VQTTS-based system for multi-speaker multi-lingual text-to-speech synthesis, achieving high fidelity and naturalness, especially in low-resource scenarios, by utilizing VQ acoustic features, speaker and language embeddings, and a cross-lingual synthesis approach.
Fish-Speech는 기존 TTS 시스템의 한계를 극복하기 위해 LLM과 듀얼 AR 아키텍처를 활용하여 고품질의 다국어 음성 합성, 특히 음성 복제 및 실시간 처리에 뛰어난 성능을 보이는 혁신적인 TTS 프레임워크입니다.
Fish-Speech is a novel TTS framework leveraging LLMs and a dual autoregressive architecture to achieve high-quality, multilingual speech synthesis and improved voice cloning capabilities, outperforming traditional G2P-based methods.
LINA-SPEECH, a novel text-to-speech synthesis model, utilizes Gated Linear Attention and Initial-State Tuning to achieve comparable or superior performance to larger, more complex models while maintaining parameter efficiency.
Very Attentive Tacotron (VAT), a novel autoregressive Transformer-based TTS model, overcomes robustness and length generalization limitations of previous models by incorporating an alignment mechanism with interpolated relative position biases (IRPBs), achieving high-quality speech synthesis for virtually any length without word omissions or repetitions.
사전 훈련된 대규모 언어 모델(LLM)을 매개변수 효율적 미세 조정(PEFT) 및 Mixture-of-Experts 아키텍처를 통해 음성 생성 기능을 갖춘 텍스트 음성 변환(TTS) 시스템 및 텍스트-음성 멀티모달 LLM으로 발전시킬 수 있습니다.
VQTTS, a novel text-to-speech synthesis system, leverages self-supervised vector-quantized acoustic features and a redesigned acoustic model to achieve state-of-the-art performance in speech naturalness.
ControlSpeech is a novel TTS system that leverages a decoupled codec and a novel Style Mixture Semantic Density (SMSD) module to achieve simultaneous zero-shot speaker cloning and flexible style control, addressing limitations of previous models that could not independently manipulate content, timbre, and style.
Continuous speech tokenization enhances text-to-speech synthesis by preserving more audio information than traditional discrete methods, leading to improved speech continuity, quality, and robustness to sampling rate variations.
Quantizing inherently continuous modalities like audio for text-to-speech synthesis may be suboptimal, and continuous representation learning using per-token latent diffusion models like SALAD offers a competitive alternative with superior intelligibility.