The authors make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits a more pronounced low-rank structure compared to the feed-forward network (FFN) sub-layer. Based on this insight, they propose a mixed compression model called LoRAP, which combines low-rank matrix approximation and structured pruning, with each technique applied to the respective sub-layers.
For the MHA sub-layer, the authors devise an Activation Weighted SVD (AWSVD) method, which evaluates the weight importance based on the ℓ2 norm of the corresponding input activations. They also discover that the weight matrices in the MHA sub-layer have varying low-rank degrees, and propose a novel parameter allocation scheme to address this discrepancy.
For the FFN sub-layer, the authors introduce a gradient-free structured channel pruning method, which removes the associated channels according to their group importance. Interestingly, they find that the least important 1% of parameters actually play a vital role in the model's performance, and suggest retaining these parameters under a fixed parameter budget.
The authors evaluate the performance of the compressed model through zero-shot perplexity on WikiText2 and PTB datasets, as well as zero-shot task classification on 7 common-sense reasoning datasets. At multiple compression ratios, their LoRAP method outperforms existing structured pruning and low-rank approximation methods.
To Another Language
from source content
arxiv.org
Ключові висновки, отримані з
by Guangyan Li,... о arxiv.org 04-16-2024
https://arxiv.org/pdf/2404.09695.pdfГлибші Запити