Temel Kavramlar
Divergent Token Metrics (DTMs) provide a more nuanced and accurate evaluation of compressed Large Language Models (LLMs) compared to traditional perplexity or accuracy measures, enabling deeper insights into the impacts of individual model components during the compression process.
Özet
The paper introduces Divergent Token Metrics (DTMs), a novel approach to assessing the performance of compressed LLMs. Traditional metrics like perplexity or accuracy fail to capture the subtle nuances in text generation quality introduced by compression.
The key highlights are:
DTMs, including the First Divergent Token Metric (FDTM) and the Share of Divergent Tokens Metric (SDTM), directly measure the divergence between the outputs of the original and compressed models during the iterative generation process. This provides a more accurate reflection of the actual text generation quality.
Using FDTM for model sparsification, the authors show that 25% of all attention components in the Llama-2 model family can be pruned beyond 90% while still maintaining state-of-the-art performance.
For model quantization, FDTM suggests that more than 80% of the parameters can be naively converted to int8 without special outlier management, outperforming standard techniques like LLM.int8().
The proposed metrics reveal that the attention mechanism is not utilized efficiently throughout the model, with a significant portion of attention components being highly prunable. This suggests the need for more targeted compression strategies that consider the individual importance of model components.
Overall, the Divergent Token Metrics provide a more comprehensive and accurate assessment of compressed LLMs, enabling the development of more efficient and effective compression techniques.
İstatistikler
25% of attention components can be pruned beyond 90% while maintaining SOTA performance.
More than 80% of model parameters can be converted to int8 without special outlier management.
Alıntılar
"Perplexity fails to identify minor variations in model degradation at an early stage."
"Divergent Token Metrics closely reflect the generation process and so can be a measure to foster confidence in the deployed compressed models."