Core Concepts
The authors propose Emotion Neural Transducer (ENT) and its factorized variant (FENT) to enable fine-grained speech emotion recognition by jointly modeling acoustic and linguistic information through neural transducer architecture.
Abstract
The authors present two key components of the Emotion Neural Transducer (ENT) model:
Emotion Joint Network: This extends the typical neural transducer architecture by adding an emotion joint network to integrate representations from the acoustic encoder and linguistic predictor, enabling the modeling of emotion categorical distribution through the alignment lattice.
Lattice Max Pooling: To distinguish emotional and non-emotional frames in a weakly supervised setting, the authors propose a lattice max pooling loss that selects the nodes with the highest predicted probability of the target emotion and the minimum non-emotional probability.
The authors further extend ENT to the Factorized Emotion Neural Transducer (FENT), which disentangles the blank symbol from vocabulary prediction and shares the predictor for both blank and emotion prediction. This allows the blank symbol to serve as an underlying indicator of emotion during inference.
Experiments on the IEMOCAP dataset show that the ENT models outperform state-of-the-art utterance-level speech emotion recognition methods, while also achieving low word error rates. The authors also validate the fine-grained emotion modeling capability of their approaches on the ZED speech emotion diarization dataset.
Stats
The IEMOCAP dataset contains utterances annotated with single emotion category labels.
The ZED dataset includes utterances annotated with emotional boundaries for each.