แนวคิดหลัก
MM-TTS, a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
บทคัดย่อ
The MM-TTS framework consists of two key components:
-
Emotion Prompt Alignment Module (EP-Align):
- Employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.
- Constructs vision-prompt, audio-prompt, and text-prompt embedding spaces to facilitate the alignment of multimodal representations.
- Uses prompt-anchoring to bridge the implicit and explicit emotion representations, enabling a more coherent integration of the emotion embedding into the TTS process.
-
Emotion Embedding-Induced TTS (EMI-TTS):
- Integrates the aligned emotional embeddings from EP-Align with state-of-the-art TTS models, including Tacotron2, VITS, and FastSpeech2, to synthesize speech that accurately reflects the intended emotions.
- Employs prompt-anchoring multimodal fusion to mitigate the bias among the multimodal emotion embedding spaces and coherently integrate the emotion embedding into the TTS process.
- Offers flexibility in generating emotional speech across different scenarios and requirements by incorporating various TTS models.
Extensive evaluations across diverse datasets demonstrate the superior performance of MM-TTS compared to traditional E-TTS models. Objective metrics, including Word Error Rate (WER) and Character Error Rate (CER), show significant improvements on the ESD dataset, with MM-TTS achieving scores of 7.35% and 3.07%, respectively. Subjective assessments further validate that MM-TTS generates speech with emotional fidelity and naturalness comparable to human speech.
สถิติ
The Word Error Rate (WER) of MM-TTS (FastSpeech) is 7.35%, and the Character Error Rate (CER) is 3.07% on the ESD dataset.
The Emotion Similarity Mean Opinion Score (MOS) of MM-TTS (FastSpeech) is 4.37, closely matching the ground truth MOS of 4.57.
The Speech Naturalness MOS of MM-TTS (FastSpeech) is 4.29, and the Speaker Similarity MOS is 4.13.
คำพูด
"MM-TTS, a groundbreaking framework designed to elevate the expressiveness of synthesized speech by incorporating multimodal cues encompassing text, audio, and visual information."
"By effectively aligning emotional features and filtering out complex noise in multimodal content, EP-Align addresses the challenge of distribution discrepancies, serving as a critical component in enabling high-quality emotional speech generation."
"The incorporation of these aligned emotional embeddings enhances the naturalness and credibility of the generated audio, resulting in a more engaging and immersive user experience."