المفاهيم الأساسية
Weight-Inherited Distillation (WID) offers a novel approach to compressing BERT models without the need for additional alignment losses, showcasing superior performance in task-agnostic settings.
الملخص
The content discusses the Weight-Inherited Distillation (WID) method for compressing BERT models without requiring extra alignment losses. It introduces the concept of inheriting weights directly from the teacher model to train a compact student model. The paper outlines the process of WID, including structural re-parameterization and compactor compression strategies. Experimental results on GLUE and SQuAD benchmarks demonstrate that WID outperforms traditional KD-based baselines. Further analysis shows that WID can learn high-level semantic knowledge such as attention patterns from the teacher model.
Directory:
- Abstract
- Introduces Knowledge Distillation (KD) and proposes Weight-Inherited Distillation (WID).
- Introduction
- Discusses Transformer-based Pre-trained Language Models (PLMs) and challenges in storage and computation.
- Approach
- Describes how WID directly transfers knowledge by inheriting weights without alignment losses.
- Experiments
- Details experiments on downstream NLP tasks with different student model sizes.
- Results
- Compares WID with other distillation methods in terms of performance and parameter efficiency.
- Analysis and Discussion
- Explores topics like pruning vs. WID, MHA strategies, impact of teacher models, and visualization of attention distributions.
- Related Work
- Mentions related work on BERT compression techniques and knowledge distillation.
الإحصائيات
"Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines."
"WID retains 98.9% and 90.9% performance of BERTbase using only 49.2% and 10.2% parameters, respectively."
اقتباسات
"In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher."
"Our contributions can be summarized as follows: We propose Weight-Inherited Distillation (WID), revealing a new pathway to KD by directly inheriting the weights via structural re-parameterization."