Leveraging structured pruning and continued pre-training, we can produce smaller yet competitive large language models that require only a fraction of the compute budget compared to training from scratch.
An accurate and efficient low-bitwidth post-training quantization method, QLLM, is proposed to address the challenge of activation outliers in quantizing large language models.
Divergent Token Metrics (DTMs) provide a more nuanced and accurate evaluation of compressed Large Language Models (LLMs) compared to traditional perplexity or accuracy measures, enabling deeper insights into the impacts of individual model components during the compression process.
Neither forward Kullback-Leibler (FKL) divergence nor reverse Kullback-Leibler (RKL) divergence exhibits the expected mean-seeking or mode-seeking behaviors in knowledge distillation for large language models. Instead, both FKL and RKL converge to the same optimization objective after a sufficient number of epochs. However, due to practical constraints, large language models are rarely trained for such an extensive number of epochs. The authors propose an Adaptive Kullback-Leiber (AKL) divergence method that adaptively allocates weights to combine FKL and RKL, focusing on aligning the head and tail parts of the distributions.
A parameter-efficient, distillation-based approach for training a palette of smaller language models from a large pre-trained teacher model, enabling efficient deployment on edge devices.
Certain consecutive layers in large language models have minimal impact on hidden states, allowing for effective layer pruning without significant performance degradation.