The authors aim to fill the gap in simultaneous interpretation (SI) data for training and evaluating simultaneous machine translation (SiMT) systems. They start with the non-parallel NAIST-SIC corpus and propose a two-stage alignment approach to create a parallel SI dataset, NAIST-SIC-Aligned.
The first stage is coarse alignment, which involves identifying minimal groups of source and target sentences that are considered translations of each other. The second stage is fine-grained alignment, where intra- and inter-sentence filtering techniques are applied over the coarse-aligned pairs to improve data quality. Each step is validated either manually or automatically to ensure the quality of the final corpus.
The authors also compile a small-scale, manually curated SI test set for evaluation purposes. They summarize the alignment challenges and findings to guide future SI corpus construction for other language pairs. Finally, they build SiMT systems based on their corpus and show significant improvement over baselines in both translation quality and latency.
To Another Language
from source content
arxiv.org
ข้อมูลเชิงลึกที่สำคัญจาก
by Jinming Zhao... ที่ arxiv.org 04-02-2024
https://arxiv.org/pdf/2304.11766.pdfสอบถามเพิ่มเติม