核心概念
Compression techniques such as pruning, knowledge distillation, and quantization can significantly improve the efficiency and effectiveness of small-data language models without compromising their performance.
摘要
This paper investigates the effectiveness of pruning, knowledge distillation, and quantization on the small-data, low-resource language model AfriBERTa. The key findings are:
Distillation:
- Distillation can achieve up to 31% compression while maintaining competitive results, with only a 7% performance drop for the least-performing model and a 1.9% decline compared to the best-performing AfriBERTa model at 22% compression.
- The selected teacher model (base vs large) significantly influences the performance of the distilled student models.
Pruning:
- Pruning before fine-tuning produces consistent performance with the dense model up to 60% sparsity, while pruning after fine-tuning maintains performance up to 50% sparsity.
- Certain languages, like Swahili, maintain moderate performance even at 95% sparsity, suggesting the model's robustness to pruning. However, languages with complex linguistic structures, like Yoruba, exhibit greater performance degradation.
- Pruning can positively impact out-of-domain generalization for some languages, while the benefits are limited for others.
Quantization:
- LLM.int8() quantization outperforms dynamic quantization, with an average decrease in F1-score of just 4.7% compared to the original model.
- Quantization can significantly reduce model size (64.08%) and inference time (52.3%) without compromising performance.
The study demonstrates that compression techniques can effectively optimize small data-pretrained models for deployment on resource-constrained devices while maintaining their performance and generalization capabilities.
統計資料
The AfriBERTa model has 126M parameters in the large variant and 111M parameters in the base variant.
The MasakhaNER dataset covers 10 African languages with a total of 21,000 sentences and over 15,000 named entities.
引述
"Compression techniques have been crucial in advancing machine learning by enabling efficient training and deployment of large-scale language models. However, these techniques have received limited attention in the context of low-resource language models."
"Our experimental results demonstrate that pruning achieves ≈60% reduction in model size with a minimal performance drop. Furthermore, generalization tests reveal varied outcomes, with some languages surpassing dense models even with extreme pruning."
"Distillation achieves compression rates between 22% and 33% with comparable performances. Additionally, quantization reduces the model size by 64.08%, inference time by 52.3%, and even outperforms the baseline model in the F1 score for certain languages."