Conceptos Básicos
HybridVC is an efficient any-to-any voice conversion model that supports both text and audio prompts to achieve flexible voice style conversion.
Resumen
HybridVC is a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. It supports both text and audio prompts, enabling more flexible voice style conversion compared to previous methods that rely solely on text or audio prompts.
Key highlights:
- HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pre-trained speaker encoder. It optimizes style text embeddings to align with the speaker style information through contrastive learning in parallel.
- This approach enables HybridVC to be efficiently trained under limited computational resources, while maintaining competitive performance on speech intelligibility, naturalness, and audio quality compared to baseline models.
- Experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multimodal voice style conversion, underscoring its potential for widespread applications such as user-defined personalized voice in various social media platforms.
- A comprehensive ablation study further validates the effectiveness of the proposed method, including the benefits of the latent model and the negative sampling technique for text embedding optimization.
Estadísticas
HybridVC can achieve competitive performance on speech intelligibility, naturalness, and audio quality with only 15 hours of training on limited computational resources.
HybridVC maintains overall performance without noticeable degradation when extending training to the PromptSpeech dataset, despite the backbone CVAE only being pre-trained on the VCTK training set.
HybridVC effectively maintains the prosody of source speech and audio quality, and accurately converts the voice characteristics given audio and text prompts.
HybridVC successfully adapts voices to match the specified style text prompts, with an accuracy of 89.8% for "higher pitch" and 91.1% for "higher volume" prompts.
Citas
"HybridVC supports text and audio prompts, enabling more flexible voice style conversion."
"HybridVC can be efficiently trained under limited computational resources and achieves a flexible VC system that supports hybrid prompts."
"HybridVC supports small-scale training which can be easily adapted to applications such as user-defined personalised voice."