Conceitos essenciais
Leveraging the complementary strengths of large language models (LLMs) and supervised machine translation (MT) systems, this work explores strategies to guide LLaMA-2 models to improve MT outputs using external feedback on translation errors.
Resumo
The paper explores techniques to guide LLaMA-2 language models to improve machine translation (MT) outputs using external feedback on translation errors. The authors consider three levels of feedback granularity: generic, score-based, and fine-grained error annotations.
For the prompting experiments:
- Zero-shot prompting with any form of feedback leads to marginal improvements in translation quality metrics like BLEU, TER, and COMET.
- 10-shot prompting widens the performance gap between the original and post-edited MT, with consistent gains in BLEU, TER, and COMET scores.
- The performance gap between the smaller 7B and larger 13B LLaMA-2 models narrows down with increased few-shot examples, suggesting that few-shot learning helps bridge the size gap for MT post-editing.
- The different granularity of feedback show similar performance in the 10-shot setting, with the fine-grained feedback not providing a clear advantage over generic feedback.
For the fine-tuning experiments:
- Fine-tuning the LLaMA-2 models with error-annotated translations leads to significant improvements in translation quality over the original MT, outperforming the best prompting results.
- The multilingual fine-tuning approach, which combines three language pairs, generally outperforms the bilingual fine-tuning.
- Human evaluation confirms that fine-tuning not only resolves the targeted errors but also generates more natural translations in the target language.
The analysis reveals that fine-tuning helps the LLMs effectively integrate the provided fine-grained feedback to address the specific errors in the initial translation.
Estatísticas
The translation quality of the original MT outputs has an average BLEU score of 0.45, TER of 0.81, and COMET score of 0.71 across the three language pairs.
The zero-shot prompting with any form of feedback leads to a marginal improvement of around 0.01-0.02 BLEU, 0.03-0.06 TER, and 0.01-0.02 COMET.
The 10-shot prompting achieves an average improvement of 0.04 BLEU, 0.03 COMET, and 0.04 TER over the original MT.
The fine-tuned models show an average improvement of 0.07 BLEU, 0.08 COMET, and 0.21 TER over the original MT.
Citações
"Leveraging the complementary strengths of LLMs and supervised MT by guiding LLMs to automatically post-edit MT with external feedback on its quality, derived from Multidimensional Quality Metric (MQM) annotations."
"Fine-tuning helps integrate fine-grained feedback more effectively and further improves translation quality based on both automatic and human evaluation."