A novel framework for expressive voice conversion based on soft speech units, adversarial style augmentation, and knowledge distillation for prosody modeling.
A fully end-to-end expressive voice conversion framework based on a conditional diffusion model that effectively models both speaker-dependent emotional cues and speaker-independent emotional style to enable any-to-any conversion.