This study develops simplified annotation guidelines and an African-centric multilingual language model to create robust machine translation evaluation metrics for a diverse set of under-resourced African languages.
Large language models show promise for reference-less translation evaluation, achieving competitive or superior correlation with human judgments compared to existing reference-less methods like COMET when fine-tuned on translation evaluation data.
Incorporating source context can effectively substitute for a reference in machine translation evaluation, outperforming reference-based metrics in many settings.
Achieving stable and replicable rankings of natural language generation systems through careful design of human evaluation methodologies.