The paper presents a new corpus of Mandarin-English conversational telephone speech, which consists of 123.5 hours of data from the CallHome Mandarin Chinese Speech and HKUST Mandarin Telephone Speech datasets. The corpus is divided into train, development, and test sets.
The primary contribution of the paper is the provision of English translations for the Mandarin speech data, enabling the corpus to be used for building speech translation systems. The translations were produced by Mandarin-English bilingual annotators through Appen, with multiple iterations of feedback and quality assurance.
The authors demonstrate the importance of using domain-specific, matched training data for building conversational speech translation systems. They present results from cascade speech translation systems, where the output of an Automatic Speech Recognition (ASR) system is used as input to a Machine Translation (MT) system. The results show that fine-tuning a general-purpose translation model (NLLB) to the Mandarin-English conversational telephone speech training set improves the BLEU score by more than 8 points, highlighting the critical role of in-domain data for achieving high-quality speech translation performance.
The authors conclude that the new corpus introduced in this paper provides a valuable resource for the research and development of conversational speech translation systems, addressing a critical gap in available resources.
To Another Language
from source content
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Shannon Woth... : arxiv.org 04-19-2024
https://arxiv.org/pdf/2404.11619.pdfDaha Derin Sorular