Основные понятия
Leveraging large language models (LLMs) with carefully designed prompting strategies incorporating context and multiple ASR system outputs significantly improves post-ASR speech emotion recognition accuracy without task-specific training.
Аннотация
Bibliographic Information:
Stepachev, P., Chen, P., & Haddow, B. (2024). Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models. arXiv preprint arXiv:2410.03312v1.
Research Objective:
This research investigates the optimal use of large language models (LLMs) for speech emotion recognition (SER) in a post-ASR setting, focusing on the effective utilization of conversation context and outputs from multiple ASR systems.
Methodology:
The researchers explored various prompting strategies for LLMs using the GenSEC Task 3 dataset, which includes ASR outputs of conversations from the IEMOCAP dataset. They experimented with different methods for selecting and ranking ASR outputs, incorporating variable conversation context lengths, and fusing outputs from multiple ASR systems. The performance of these strategies was evaluated based on their accuracy in predicting speaker emotions.
Key Findings:
- LLMs, specifically GPT-4o, demonstrate significant improvement over the baseline in speech emotion recognition accuracy when provided with carefully crafted prompts.
- Incorporating conversation context generally improves accuracy, with diminishing returns as the context window size increases.
- The choice of metric used to select the ASR transcript for LLM input significantly impacts performance, with character-level metrics like chrF and chrF++ outperforming word-level metrics like WER.
- Fusing outputs from multiple ASR systems further enhances accuracy, suggesting that different systems capture different speech nuances.
Main Conclusions:
This study highlights the potential of LLMs for training-free speech emotion recognition by effectively leveraging context and multiple ASR system outputs. The proposed prompting strategies, particularly those incorporating context and system fusion, significantly improve accuracy without requiring task-specific LLM training. This approach also mitigates the risk of overfitting to speaker-specific or ASR system-specific biases.
Significance:
This research contributes to the growing field of LLM applications in speech processing, demonstrating their effectiveness in a challenging task like SER. The findings have implications for developing robust and generalizable SER systems that rely on readily available LLMs without extensive training.
Limitations and Future Research:
The study primarily focuses on a single dataset and a limited set of LLM prompting strategies. Future research could explore the generalizability of these findings to other datasets and languages. Additionally, investigating more sophisticated context modeling techniques and alternative fusion methods could further enhance SER performance.
Статистика
GPT-4o achieved 75.1% accuracy on the GenSEC Task 3 test set, surpassing the baseline by 20%.
Increasing the context window size generally led to higher accuracy, with the most significant improvements at smaller window sizes (0 to 4).
Character-level metrics (chrF, chrF++) consistently showed higher accuracy than word-level metrics (WER) for ranking ASR outputs.
The "least punc" heuristic, which selects the ASR output with the least punctuation, achieved the highest overall accuracy among naive selection methods.
Fusing outputs from multiple ASR systems, particularly with a larger context window, yielded the highest accuracy.
Цитаты
"While (re-)training an LLM for a certain task is prohibitive in many scenarios, we take this opportunity further to understand the optimal use of LLMs in this task by exploring LLM prompting in post-ASR SER."
"To aid reproducibility, we make our code public."
"Given our training-free paradigm, we expect it to be more generalizable to other settings."
"Our final submission records an SER accuracy of 75.1% surpassing the baseline by 20%."