Core Concepts
Modular machine translation architectures with bridge components do not consistently outperform non-modular fully-shared models in terms of generalization capabilities across unseen translation directions and out-of-distribution data.
Abstract
The paper investigates the generalization capabilities of different modular machine translation architectures, including those with "bridge" components that aim to foster language-independent representations. The authors compare the performance of six Transformer-based models, including fully-shared, fully-modular, and semi-modular designs, on the United Nations Parallel Corpus (UNPC) and OPUS100 datasets.
The key findings are:
Encoder-shared modular architectures (E) and fully-shared non-modular architectures (F) generally outperform other modular designs, including those with bridge components (T, L).
The choice of pivot language (English or Arabic) significantly impacts the results, with Arabic-centric models performing better in zero-shot conditions but worse on seen translation directions.
Performances in zero-shot and out-of-distribution settings remain substantially lower than on seen translation directions, regardless of the architecture used.
Statistical analysis using SHAP values and an OLS model indicates that bridge-based architectures actually decrease generalization capabilities compared to other modular and non-modular designs, contrary to claims in prior work.
The authors conclude that current modular architectures, especially those using bridging layers, have limited potential for improving generalization in machine translation, as a default non-modular Transformer can often match or outperform them.
Stats
The authors train and evaluate their models on the United Nations Parallel Corpus (UNPC) and OPUS100 datasets.
They consider three training setups for UNPC: using all 30 translation directions (UNPC-All), using only the 10 directions involving English (UNPC-EN), and using only the 10 directions involving Arabic (UNPC-AR).
Quotes
"For a given computational budget, we find non-modular architectures to be always comparable or preferable to all modular designs we study."
"Our study focused on modular architectures in a small-scale, well controlled experimental protocol; we leave questions such as whether these remarks carry on at a larger scale, both of model parameter counts and number of languages concerned, for future work."