toplogo
登入

A Large Dataset of Spontaneous Speech with the Paulistano Accent for Automatic Speech Recognition Evaluation


核心概念
A new 239.30-hour spontaneous speech corpus with the paulistano accent in Brazilian Portuguese, the NURC-SP Audio Corpus, is introduced and used to evaluate state-of-the-art automatic speech recognition models.
摘要

The NURC-SP Audio Corpus is a new freely available dataset of spontaneous speech in Brazilian Portuguese, focusing on the paulistano (São Paulo city) accent. It contains 239.30 hours of transcribed audio recordings from 401 different speakers (204 females, 197 males).

The corpus was created by digitizing and transcribing audio recordings from the NURC-SP project, which documented the urban linguistic norm of educated speakers in São Paulo in the 1970s. The transcriptions were initially generated automatically using the WhisperX model and then manually revised by 14 native Brazilian Portuguese speakers.

Four automatic speech recognition (ASR) models were evaluated on the NURC-SP Audio Corpus:

  1. A fine-tuned version of the Wav2Vec2-XLSR-53 model.
  2. Another fine-tuned version of Wav2Vec2-XLSR-53 using a pre-trained model from the CORAA-ASR v1.1 dataset.
  3. A Distil-Whisper model trained on the NURC-SP Audio Corpus using labels generated by the Whisper Large-v3 model.
  4. A fine-tuned version of the Distil-Whisper model using the NURC-SP Audio Corpus.

The results show that the Distil-Whisper fine-tuned model achieved the best performance with a word error rate (WER) of 24.22%, followed by the fine-tuned Wav2Vec2-XLSR-53 model with a WER of 33.73%. These results indicate that the NURC-SP Audio Corpus is a challenging dataset for ASR, and the Distil-Whisper model shows promise for low and medium resource languages like Brazilian Portuguese.

The NURC-SP Audio Corpus and the trained ASR models are publicly available to enable further research and development in this area.

edit_icon

客製化摘要

edit_icon

使用 AI 重寫

edit_icon

產生引用格式

translate_icon

翻譯原文

visual_icon

產生心智圖

visit_icon

前往原文

統計資料
The NURC-SP Audio Corpus contains 239.30 hours of transcribed audio recordings. The corpus has 177,224 segmented audio files with an average duration of 4.86 seconds. The corpus contains a total of 2,099,306 transcribed tokens.
引述
"To the best of our knowledge, Distil-Whisper has never been evaluated for BP datasets." "The Distil-Whisper fine-tuned model achieved the best performance with a word error rate (WER) of 24.22%, followed by the fine-tuned Wav2Vec2-XLSR-53 model with a WER of 33.73%."

深入探究

How can the NURC-SP Audio Corpus be used to improve the performance of ASR models for other regional accents or dialects of Brazilian Portuguese?

The NURC-SP Audio Corpus, with its extensive collection of 239.30 hours of spontaneous speech featuring the Paulistano accent, serves as a valuable resource for enhancing Automatic Speech Recognition (ASR) models tailored to other regional accents or dialects of Brazilian Portuguese. By leveraging the corpus, researchers can employ transfer learning techniques, where models pre-trained on the NURC-SP dataset can be fine-tuned on smaller datasets representing other regional accents. This approach allows the models to retain the phonetic and prosodic characteristics learned from the Paulistano accent while adapting to the unique features of other dialects, such as those from Rio de Janeiro or the Northeast region. Furthermore, the corpus's diverse speaker demographics, including variations in age and gender, can help in creating more robust ASR systems that generalize better across different accents. The inclusion of spontaneous speech phenomena, such as disfluencies and filled pauses, also equips ASR models with the ability to handle real-world conversational speech more effectively, which is crucial for accurately recognizing and transcribing speech from various regional dialects. Overall, the NURC-SP Audio Corpus can significantly contribute to the development of ASR systems that are not only more accurate but also culturally and linguistically inclusive.

What other speech-related tasks, beyond ASR, could benefit from the availability of the NURC-SP Audio Corpus?

The NURC-SP Audio Corpus can be instrumental in various speech-related tasks beyond Automatic Speech Recognition (ASR). One significant application is in the field of speech synthesis, where the corpus can be used to train text-to-speech (TTS) systems that generate natural-sounding speech in the Paulistano accent. By utilizing the spontaneous speech data, TTS systems can learn to produce more authentic and contextually appropriate intonations, rhythms, and prosodic features, enhancing the overall user experience. Additionally, the corpus can support research in speaker recognition and verification, as it contains a diverse range of speakers. This diversity allows for the development of models that can accurately identify and verify speakers based on their unique vocal characteristics. Moreover, the NURC-SP Audio Corpus can facilitate studies in linguistic research, particularly in analyzing conversational dynamics, discourse analysis, and sociolinguistics. Researchers can investigate how different speakers use language in spontaneous contexts, examining phenomena such as code-switching, speech patterns, and the influence of social factors on language use. Finally, the corpus can also be valuable for training and evaluating models in emotion recognition and sentiment analysis, as the spontaneous speech recordings capture a wide range of emotional expressions and conversational nuances, providing rich data for understanding human communication.

What insights could be gained by analyzing the linguistic phenomena, such as disfluencies and prosodic features, present in the spontaneous speech recordings of the NURC-SP Audio Corpus?

Analyzing the linguistic phenomena present in the NURC-SP Audio Corpus, particularly disfluencies and prosodic features, can yield significant insights into the nature of spontaneous speech in Brazilian Portuguese. Disfluencies, such as filled pauses (e.g., "uh," "um") and repetitions, can reveal cognitive processes involved in speech production, including how speakers manage their thoughts and navigate conversational dynamics. Understanding these patterns can inform the development of more sophisticated ASR systems that can better handle real-world speech, which often deviates from the idealized, fluent speech found in read speech datasets. Prosodic features, including intonation, stress, and rhythm, play a crucial role in conveying meaning and emotion in spoken language. By examining these features in the NURC-SP recordings, researchers can gain insights into how speakers use prosody to signal questions, emphasis, or emotional states, which can enhance the design of speech synthesis systems and improve the naturalness of generated speech. Furthermore, the analysis of these linguistic phenomena can contribute to sociolinguistic studies by highlighting how different speakers use language in spontaneous contexts, reflecting their social identities, regional backgrounds, and conversational styles. This understanding can lead to more inclusive language technologies that account for the rich diversity of spoken language in Brazil, ultimately fostering better communication and interaction in various applications, from virtual assistants to educational tools.
0
star