Exploring the Performance Bottleneck of Small Language Models: The Role of the Softmax Bottleneck
Small language models can suffer from performance saturation due to a mismatch between the hidden dimension of the model and the high rank of the target contextual probability distribution, which leads to a softmax bottleneck in the linear prediction head.