The paper proposes a method to improve neural machine translation (NMT) by incorporating automatically recognized emotional prosody from speech. The approach involves a two-stage procedure:
A state-of-the-art speech emotion recognition (SER) model is used to predict dimensional emotion values (arousal, dominance, valence) from input audio recordings.
The predicted emotion values are converted into discrete tokens and added at the beginning of the corresponding input text sentences. The emotion-aware input is then used to train the NMT model.
Experiments are conducted on the Libri-trans dataset, which contains English-French parallel text and audio data. The results show that integrating emotion information, especially arousal, into the NMT model leads to better translation quality as measured by BLEU scores. Using emotion tokens extracted from real speech recordings outperforms using tokens from synthesized speech.
The authors note that all BLEU scores are relatively low, indicating poor overall translation quality, likely due to the nature of the Libri-trans dataset which contains book-like language rather than conversational speech. Further experiments on other datasets are suggested to validate the effectiveness of the proposed emotion-aware NMT approach.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Charles Braz... at arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.17968.pdfDeeper Inquiries