Core Concepts
Certain consecutive layers in large language models have minimal impact on hidden states, allowing for effective layer pruning without significant performance degradation.
Abstract
The content discusses a method for compressing large language models (LLMs) by pruning unimportant layers. The key observations are:
Different layers in LLMs have varying degrees of impact on the hidden states, with some consecutive layers exhibiting minimal perturbation.
The authors propose using the cosine similarity between layer input and output hidden states as a metric to identify less important layers.
The proposed LLM-Streamline framework consists of two steps:
Layer pruning: Removing a set of consecutive layers with the lowest importance based on the target sparsity.
Layer replacement: Training a lightweight model, such as a multi-layer perceptron (MLP), to substitute the pruned layers and mitigate performance degradation.
Experiments on various LLM architectures (OPT, Llama2) show that LLM-Streamline can maintain 92% of the original model's performance on classification tasks and 68% on generation tasks, while reducing the model size by 25%.
The authors also explore the impact of different lightweight models, the number of lightweight models, and the amount of training data on the performance of the pruned models.
Stats
The content does not provide specific numerical data or metrics, but rather discusses the general approach and findings of the proposed LLM-Streamline method.