The paper explores using neural architecture search (NAS) for structural pruning of pre-trained language models (PLMs) like BERT and RoBERTa. The goal is to find sub-networks of the pre-trained model that optimally trade-off efficiency (e.g., model size or latency) and generalization performance.
The key insights are:
NAS offers a distinct advantage over other pruning strategies by enabling a multi-objective approach to identify the Pareto optimal set of sub-networks. This allows automating the compression process and selecting the best model that meets the requirements, instead of running the pruning process multiple times to find the right threshold.
The authors propose four different search spaces to prune transformer-based architectures, which exhibit varying degrees of pruning complexity. They show that simpler search spaces like SMALL and LAYER can often outperform more expressive but harder to explore spaces like LARGE.
The authors evaluate weight-sharing based NAS approaches, which train a single super-network and then search for sub-networks within it. This substantially reduces the computational cost compared to standard NAS where each sub-network is fine-tuned independently.
Empirically, the NAS-based pruning methods achieve competitive or better performance compared to other structural pruning approaches like head pruning and layer dropping, especially for larger datasets.
Overall, the paper demonstrates the effectiveness of using NAS for structural pruning of pre-trained language models to balance model efficiency and performance.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Aaron Klein,... ב- arxiv.org 05-06-2024
https://arxiv.org/pdf/2405.02267.pdfשאלות מעמיקות