核心概念
Language models like BERT and RoBERTa develop internal subnetworks that correspond to theoretical linguistic categories, demonstrating a degree of learned grammatical understanding that can be analyzed using Shapley Head Values and pruning techniques.
統計
BERT has 110 million parameters and was trained on 16GB of data.
RoBERTa has 125 million parameters and was trained on 160GB of data.
The BLiMP dataset consists of 67 minimal pair paradigms, each containing 1,000 sentence pairs, categorized into 13 linguistic phenomena.
Pruning the top 10 attention heads (representing 7% of all heads) resulted in a significant impact on accuracy across various paradigms.