核心概念
Addressing the challenge of code-mixed translation through synthetic data generation and joint learning to achieve robustness.
摘要
The content discusses the challenges of code-mixed translation in a multilingual world, proposing a solution through synthetic data generation and joint learning. It introduces the HINMIX corpus, a perturbation-based model RCMT, and explores zero-shot translation for Bengali. The experiments show superior performance over existing methods in both code-mixed and robust machine translation tasks.
- Introduction
- Online communication features code-mixing.
- Scarcity of annotated data poses challenges.
- Real-world text is prone to errors.
- Data Extraction
- "First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~4.2M sentence pairs."
- "Our evaluation demonstrates the superiority of RCMT over state-of-the-art methods."
- Quotations
- "The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance."
- "A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation."
- Related Work
- Previous studies on code-mixing and intrasentential code-switching.
- Various methods proposed for synthetic CM generation.
- Robust Code-Mixed Translation
- Formulation of RCMT using joint learning framework.
- Training model on clean and noisy CM text for robustness.
- Experiments and Results
- Comparative results showing RCMT outperforming baselines.
- Evaluation on different datasets showcasing effectiveness.
- Zero-shot Code-mixed MT (ZCMT)
- Training model on Bengali-English and Hindi-English corpora.
- Testing on Bengali CM translations with promising results.
- Conclusion
- Proposed strategy for translating real-world code-mixed sentences effectively.
統計資料
"First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~4.2M sentence pairs."
"Our evaluation demonstrates the superiority of RCMT over state-of-the-art methods."
引述
"The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance."
"A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation."