The paper introduces SpeechAlign, a framework for evaluating the alignment capabilities of speech translation models. It consists of two main components:
The Speech Gold Alignment dataset: This dataset extends the existing text translation gold alignment dataset by Vilar et al. (2006) to the speech domain. It includes synthetic speech generated using a text-to-speech model, along with timestamps for each word, enabling the evaluation of word-to-audio alignments.
Two novel metrics: Speech Alignment Error Rate (SAER) and Time-weighted SAER (TW-SAER). These metrics adapt the Alignment Error Rate (AER) metric from text translation to the speech domain. SAER treats each word equally, while TW-SAER assigns weights based on the duration of each word in the speech signal.
The paper also provides a preprocessing step to convert token-to-token contribution maps (e.g., attention weights) into word-to-word alignments, which are then used to compute the SAER and TW-SAER scores. Finally, the authors benchmark various open-source speech translation models using the SpeechAlign framework, demonstrating its utility.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Bele... lúc arxiv.org 04-26-2024
https://arxiv.org/pdf/2309.11585.pdfYêu cầu sâu hơn