Core Concepts
Vocabulary trimming, a common practice in neural machine translation, fails to consistently improve model performance and can even lead to substantial degradation across a wide range of hyperparameter settings.
Abstract
The paper presents a comprehensive study on the effects of vocabulary trimming in neural machine translation. Vocabulary trimming is a post-processing step that replaces rare subwords with their component subwords, with the goal of reducing model size and improving performance through robustness.
The key findings are:
- Trimming the optimal baseline model generally leads to a decrease in BLEU score, with no positive effect observed.
- Trimming can help recover some performance in very low-performing baseline models, but this effect is not consistent across different configurations.
- Trimming only the source or only the target vocabulary does not have a consistently positive effect, and aggressively trimming the source can lead to a negative trend.
- Trimming such that 95% of tokens appear more than 100 times has, at best, only a slight positive effect for suboptimal BPE configurations.
- Preserving terminal subwords (which represent full words or concepts) during trimming does not lead to consistent improvements.
- Initializing a smaller vocabulary directly outperforms trimming a larger vocabulary to the same effective size.
- Trimming in a joint vocabulary setting also generally reduces model performance.
- The findings hold true for both the small IWSLT14 dataset and the larger Europarl dataset.
Overall, the paper concludes that vocabulary trimming, a commonly recommended practice, does not consistently improve neural machine translation performance and can even lead to substantial degradation across a wide range of hyperparameter settings.
Stats
The paper does not provide any specific numerical data or statistics to support the key findings. The results are presented qualitatively through comparisons of BLEU scores and vocabulary sizes.