The paper introduces SpeechAlign, a framework for evaluating the alignment capabilities of speech translation models. It consists of two main components:
The Speech Gold Alignment dataset: This dataset extends the existing text translation gold alignment dataset by Vilar et al. (2006) to the speech domain. It includes synthetic speech generated using a text-to-speech model, along with timestamps for each word, enabling the evaluation of word-to-audio alignments.
Two novel metrics: Speech Alignment Error Rate (SAER) and Time-weighted SAER (TW-SAER). These metrics adapt the Alignment Error Rate (AER) metric from text translation to the speech domain. SAER treats each word equally, while TW-SAER assigns weights based on the duration of each word in the speech signal.
The paper also provides a preprocessing step to convert token-to-token contribution maps (e.g., attention weights) into word-to-word alignments, which are then used to compute the SAER and TW-SAER scores. Finally, the authors benchmark various open-source speech translation models using the SpeechAlign framework, demonstrating its utility.
To Another Language
from source content
arxiv.org
Deeper Inquiries