Core Concepts
VoxGenesis introduces an unsupervised speech synthesis framework that discovers a latent speaker manifold and enables voice editing without supervision. By transforming a Gaussian distribution into speech distributions conditioned by semantic tokens, VoxGenesis disentangles speaker characteristics from content information.
Abstract
VoxGenesis proposes an unsupervised approach to speech synthesis, allowing for the discovery of latent speaker characteristics. The model enables voice editing by manipulating latent codes along identified directions associated with specific speaker attributes such as gender, pitch, tone, and emotion. Through extensive experiments, VoxGenesis demonstrates superior performance in producing diverse and realistic speakers compared to previous approaches.
The paper discusses the limitations of current speech synthesis models in generating new voices and highlights the importance of disentangling content from speaker features. VoxGenesis's innovative approach transforms a Gaussian distribution into speech distributions conditioned on semantic tokens, enabling more sophisticated voice editing and customization. The model's ability to uncover human-interpretable directions associated with specific speaker characteristics sets it apart from traditional supervised methods.
VoxGenesis leverages deep generative models to transform random noise into meaningful speech distributions while maintaining control over semantic information. By integrating a mapping network, shared embedding layer, and semantic transformation matrices, VoxGenesis can identify major variances in the latent space associated with different speaker attributes. The model's unique architecture allows for efficient encoding of external speaker representations and stable training processes.
The evaluation results showcase VoxGenesis's effectiveness in generating diverse and realistic speakers with distinct characteristics. The model outperforms previous approaches in terms of fidelity to training speakers, diversity in generated speakers, and overall speech quality. Additionally, VoxGenesis demonstrates promising results in zero-shot voice conversion tasks and multi-speaker TTS applications.
Stats
Achieving nuanced and accurate emulation of human voice has been a longstanding goal in artificial intelligence.
Mainstream speech synthesis models rely on supervised speaker modeling and explicit reference utterances.
In this paper, we propose VoxGenesis, an unsupervised speech synthesis framework.
VoxGenesis transforms a Gaussian distribution into speech distributions conditioned by semantic tokens.
Sampling from the Gaussian distribution enables the creation of novel speakers with distinct characteristics.
Extensive experiments show that VoxGenesis produces significantly more diverse and realistic speakers than previous approaches.
Latent space manipulation uncovers human-interpretable directions associated with specific speaker characteristics.
Voice editing is enabled by manipulating latent codes along identified directions.
VoxGenesis can be used in voice conversion and multi-speaker TTS applications.
Quotes
"VoxGenesis introduces an unsupervised approach to speech synthesis."
"The model enables voice editing by manipulating latent codes along identified directions."