Kernkonzepte
Vocabulary trimming techniques based on language heuristics can reduce the memory usage of small language models by up to 50% and improve generation speed by up to 25%, but their effectiveness diminishes for larger models and certain languages.
Zusammenfassung
The paper examines the use of vocabulary trimming (VT) techniques to improve the efficiency of large language model (LLM) inference. Two language-based heuristics are explored: Unicode-based script filtering and corpus-based selection.
The experiments were conducted on four languages (Bulgarian, Chinese, English, and Spanish) using the BLOOM and LLaMA LLM families of different sizes. The results show that:
- Unicode-based script filtering can maintain quality for Latin-based languages but harms performance for languages requiring code-mixing.
- Corpus-based selection leads to fewer alterations but is less effective in reducing the embedding size.
- The benefits of VT diminish as the model size increases, as the proportion of the embedding matrices becomes smaller in larger models.
The oracle vocabulary selection experiment reveals an upper bound of 20% time improvement for smaller BLOOM models and only 5-10% for larger 7B models. Memory usage can be reduced by nearly 50% for the smallest BLOOM model, but the relative reduction is modest for the larger models.
The authors conclude that while VT can improve efficiency, it has limitations in maintaining consistent performance across languages and model sizes. They suggest that VT can be applied orthogonally to other efficiency methods like efficient attention and quantization.
Statistiken
The paper reports the following key metrics:
Vocabulary size (|V|) for the full and trimmed models
Decoding time (in minutes:seconds) for the full and trimmed models
Percentage of misses (miss) between the full and trimmed model outputs
Overlap BLEU (o-BLEU) and chrF (o-chrF) scores between the full and trimmed model outputs
Zitate
"Unicode-based script filtering maintains quality for Latin-based languages but harms languages requiring code-mixing."
"Corpus-based selection leads to fewer alterations but is less effective in reducing the embedding size."
"The benefits of VT diminish as the model size increases, as the proportion of the embedding matrices becomes smaller in larger models."