Concetti Chiave
We introduce MEDIT, a set of multilingual models capable of performing various text editing tasks like grammatical error correction, text simplification, and paraphrasing across multiple languages by fine-tuning large pre-trained language models via instruction tuning.
Sintesi
The authors introduce MEDIT, a multilingual extension of the COEDIT text editing models. MEDIT models are trained by fine-tuning multilingual large pre-trained language models (LLMs) via instruction tuning. They are designed to take natural language instructions from the user specifying the desired text attributes, such as "Grammatik korrigieren" (German) or "ᄋ
ᅵᄐ
ᅦ
ᆨᄉ
ᅳᄐ
ᅳᄅ
ᅳ
ᆯᄃ
ᅡ
ᆫᄉ
ᅮ
ᆫ
ᄒ
ᅪ" (Korean).
The authors build MEDIT by curating data from multiple publicly available human-annotated text editing datasets for three tasks (Grammatical Error Correction, Text Simplification, and Paraphrasing) across seven languages from six different language families. They evaluate the performance of MEDIT models on various multilingual text editing benchmarks and find that they generalize effectively to new languages compared to multilingual baselines.
The key highlights are:
- MEDIT models can perform multilingual and cross-lingual text editing across three tasks (GEC, Simplification, Paraphrasing) in seven languages.
- MEDIT models outperform various multilingual LLMs on text editing tasks, especially when fine-tuned on task-specific datasets.
- MEDIT models show strong generalization to new languages not seen during training.
- The authors provide detailed analyses on the impact of model architecture, scale, and task-specific data on performance.
- Human evaluations confirm the quality of MEDIT model outputs across fluency, adequacy, and accuracy.
Statistiche
The training dataset consists of over 200k instructional input-output pairs across the three text editing tasks and seven languages.
The authors curated data from multiple publicly available human-annotated datasets.
The amount of training data varies significantly across languages and tasks, with English having the largest dataset.
Citazioni
"MEDIT models can perform text editing operations for three popular tasks: Grammatical Error Correction, Paraphrasing, and Text Simplification, in multilingual and cross-lingual settings across a diverse set of seven languages, spanning six different language families."
"We evaluate the performance of our models extensively on text editing benchmarks in both multilingual and cross-lingual settings to demonstrate their effectiveness."
"Through a comprehensive set of controlled experiments, we provide insights on how model performance on multilingual text editing tasks is affected by various choices like model architecture, model scale, and training data mixtures."