Conceptos Básicos
KazParC is a large-scale parallel corpus designed to facilitate machine translation across Kazakh, English, Russian, and Turkish languages. The corpus was developed with the assistance of human translators and contains over 371,000 parallel sentences spanning diverse domains. The research also introduces Tilmash, a neural machine translation model that demonstrates competitive performance compared to industry-leading services.
Resumen
The paper introduces KazParC, a parallel corpus for machine translation across Kazakh, English, Russian, and Turkish languages. Key highlights:
- KazParC is the first and largest publicly available parallel corpus of its kind, containing 371,902 parallel sentences covering various domains.
- The corpus was developed with the help of human translators to ensure quality and alignment across the language pairs.
- The authors also developed an NMT model called Tilmash, which was trained on KazParC and a synthetic corpus.
- Tilmash achieves performance on par with or exceeding industry-leading machine translation services like Google Translate and Yandex Translate, as measured by BLEU and chrF scores.
- The corpus and Tilmash model are publicly available under a Creative Commons license.
- Tilmash demonstrates strong performance on legal documents and general domain texts within the KazParC corpus, but faces some challenges with idiomatic expressions and handling pronouns.
- The inclusion of synthetic data in addition to the human-translated corpus appears to enhance the model's versatility and translation quality across diverse domains.
Estadísticas
KazParC contains 371,902 parallel sentences across Kazakh, English, Russian, and Turkish.
The corpus covers 5 broad domains: Mass media (120,547 lines), General (94,988 lines), Legal documents (77,183 lines), Education and science (46,252 lines), and Fiction (32,932 lines).
The synthetic corpus (SynC) contains 1,797,066 sentences automatically translated from English to Kazakh, Russian, and Turkish using Google Translate.
Citas
"KazParC is the first and largest publicly available corpus of its kind, containing a collection of 371,902 parallel sentences covering different domains and developed with the assistance of human translators."
"Remarkably, the performance of Tilmash is on par with, and in certain instances, surpasses that of industry giants, such as Google Translate and Yandex Translate, as measured by standard evaluation metrics, such as BLEU and chrF."