The content discusses the importance of non-verbal signals in speech, focusing on prosody. It presents an analytical schema and a classification process to interpret multi-layered prosodic events. The study aims to enhance speech technologies by formalizing prosody and shedding light on communication theories.
Non-verbal signals in speech are crucial, ranging from conversation actions to emotions. The principles governing prosodic structuring remain unclear due to the simultaneous nature of these signals. Recent developments in pattern recognition offer opportunities for understanding complex prosodic structures.
The study proposes a schema that interprets surface representations of multi-layered prosodic events. By fine-tuning a pre-trained model, it disentangles different orders of prosodic phenomena simultaneously. This method performs comparably or better than human annotation on various types of data.
In addition to formalizing prosody, understanding its patterns can contribute to communication theories and improve language technologies. Disentangling prosodic patterns can help identify constraints affecting speech organization and minimize disparities in acoustic descriptions.
The research also demonstrates the ability to add prosodic labels to aligned transcriptions using transfer learning methods. By re-training models like WHISPER, the study shows promise in decoding complex prosodic structures efficiently.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Tirza Biron,... lúc arxiv.org 03-07-2024
https://arxiv.org/pdf/2403.03522.pdfYêu cầu sâu hơn