Core Concepts
Prosody-aware VITS (PAVITS) is proposed to enhance emotional voice conversion by addressing content and emotional naturalness through an end-to-end architecture inspired by VITS, integrating acoustic converter and vocoder seamlessly.
Abstract
Prosody-aware VITS (PAVITS) aims to improve emotional voice conversion by focusing on content and emotional naturalness. The model integrates prosody predictor, emotion descriptor, and prosody alignment loss to enhance the quality of converted audio. Experimental results show superior performance compared to existing methods.
The paper highlights the challenges in emotional voice conversion, emphasizing the importance of content naturalness and rich emotion representation. PAVITS addresses these challenges through an innovative approach that combines acoustic prosody modeling with textual prosody prediction.
By leveraging a conditional variational autoencoder framework, PAVITS effectively captures fine-grained emotional prosody features across different speech emotions. The model's architecture includes modules for textual prosody prediction, acoustic prosody modeling, information alignment, and emotional speech synthesis.
Experimental results demonstrate the effectiveness of PAVITS in achieving high-quality emotional voice conversion. The model outperforms traditional methods in terms of objective evaluation metrics like Mel-cepstral distortion (MCD) and subjective Mean Opinion Score (MOS) tests.
Overall, Prosody-aware VITS (PAVITS) presents a novel approach to enhancing emotional voice conversion by integrating advanced techniques for prosody modeling and alignment. The model shows promise in improving both content naturalness and emotional expressiveness in converted audio samples.
Stats
Experimental results show that PAVITS outperforms other models with MCD values ranging from 3.42 to 4.66.
MOS scores for speech quality and naturalness range from 4.62 to 4.72 for PAVITS models.
Ablation study reveals degradation in performance when removing components like the prosody predictor or integrator.
Quotes
"PAVITS achieves competitive performance on both objective and subjective evaluation."
"Our proposed PAVITS-VL aligns more closely with human perception in converted audio."
"The spectrogram converted by PAVITS exhibits finer details in prosody variations."