Core Concepts
SelfVC is a training strategy that utilizes self-generated voice transformations to iteratively improve a controllable voice conversion model on imperfectly disentangled speech representations.
Abstract
The paper proposes SelfVC, a framework for text-free voice conversion that can modify the voice of a given utterance while providing control over the prosody of the synthesized speech.
The key components are:
Feature Extraction: The framework derives content (linguistic) embedding, speaker embedding, pitch contour, and duration information from the input audio signal.
Synthesizer Model: The synthesizer reconstructs the mel-spectrogram from the extracted features, with intermediate modules to predict the pitch and duration.
Iterative Refinement using Self Transformations: To train an effective voice conversion model, the authors propose a strategy that utilizes the synthesizer itself to generate diverse voice-converted transformations of the input audio. At each training iteration, the current state of the synthesizer is used to transform the content embedding, and the model is updated to reconstruct the original audio from this self-transformed input.
The authors demonstrate that incorporating self-synthesized transformations during training leads to significant improvements in speaker similarity metrics compared to using only heuristic transformations. SelfVC achieves state-of-the-art results in zero-shot voice conversion for English and outperforms prior approaches in cross-lingual voice conversion tasks.
Stats
The Conformer-SSL model used as the content encoder is pretrained on 56k hours of unlabelled English speech from the LibriLight dataset.
The mel-spectrogram synthesizer and HiFiGAN vocoder are trained on the train-clean-360 subset of the LibriTTS dataset, which contains 360 hours of speech from 904 speakers.
Quotes
"To expand on the heuristic set of transforms, we propose to utilize the synthesizer model itself to generate a voice-converted variation of a given utterance."
"SelfVC achieves SOTA results in zero-shot voice conversion for English and can be easily scaled to multiple languages in a text-free manner, outperforming prior approaches in cross-lingual voice conversion."