Core Concepts
The author argues for evaluating multimodal translation models using a framework that considers visual information and the ability to translate complex sentences. They propose a new evaluation method to address the limitations of current evaluation practices.
Abstract
Multimodal machine translation (MMT) models are evaluated against the Multi30k dataset, but this may not accurately measure their performance. The authors propose an evaluation framework that includes CoMMuTE, WMT news translation task test sets, and Multi30k test sets. Current MMT models trained solely on Multi30k show poor performance when tested against text-only translation models. The study highlights the importance of evaluating MMT models based on their use of visual information and ability to translate complex sentences.
Stats
Gated Fusion: 42.0 BLEU4 score against Multi30k test set
VGAMT: 43.3 BLEU4 score against Multi30k test set
FAIR-WMT19: 40.7 BLEU4 score against Multi30k test set, 37.7 BLEU4 score in newstest2019, 40.6 BLEU4 score in newstest2020
RMMT: 41.5 BLEU4 score against Multi30k test set, 33.0 BLEU4 score in newstest2019, 1.3 CoMMuTE score
Quotes
"Using this principle, we introduce an evaluation framework for MMT models that measures their use of visual information to aid in the translation task."
"Our results suggest that MMT models should be designed with a baseline high performance on text-only translation."
"The study highlights the importance of evaluating MMT models based on their use of visual information and ability to translate complex sentences."