This paper explores the use of grammatical error correction (GEC) systems on code-switched (CSW) text, where multilingual speakers use a combination of languages in a single discourse or utterance. The authors note that most existing GEC systems have been trained on monolingual data and are not developed with CSW in mind, leading to poor performance on such text.
To address this, the authors propose a novel method of generating synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. They investigate various methods of selecting these spans based on CSW ratio, switch-point factor, and linguistic constraints, and evaluate how they affect the performance of GEC systems on CSW text.
The authors' best model achieves an average increase of 1.57 F0.5 across 3 CSW test sets (English-Chinese, English-Korean and English-Japanese) without affecting the model's performance on a monolingual dataset. They also discover that models trained on one CSW language generalize relatively well to other typologically similar CSW languages.
The key highlights and insights from the paper are:
The authors conclude that data augmentation is most effective in the context of multilingual pre-trained models, and that the most linguistically motivated method of replacing a random noun token yielded the best improvement compared to other methods and the baseline.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Kelvin Wey H... a las arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.12489.pdfConsultas más profundas