The paper investigates the impact of near duplicate subwords on the performance of language models (LMs). It first conducts experiments in a controlled setting where the vocabulary is synthetically duplicated, allowing the authors to quantify an upper bound on the potential gains from improved generalization across duplicates. They find that LMs trained on a fully duplicated vocabulary require around 17% more data to achieve the same performance as LMs trained on the original vocabulary.
The authors then examine the impact of naturally occurring near duplicates in LM vocabularies. Contrary to the synthetic setting, they find that merging these near duplicates generally hurts model performance instead of improving it. This suggests that real near duplicates are not as equivalent as anticipated, and the information lost by merging them is more detrimental than the potential gains from improved generalization.
The paper further analyzes the representations learned by the LMs, showing that frequent duplicate subwords tend to have highly aligned embeddings, which enables some degree of generalization. However, this alignment is less pronounced for infrequent subwords, and the presence of deduplicated tokens in the input context tends to hurt prediction performance.
Finally, the authors experiment with re-injecting information about the original subwords through a shared learned embedding, which can mitigate some of the performance losses from deduplication, but does not achieve the same benefits as observed in the synthetic setting.
To Another Language
from source content
arxiv.org
Deeper Inquiries