Language models can store up to 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications.
Downscaling the language complexity during pre-training enables smaller generative language models to exhibit emergent zero-shot learning capabilities comparable to larger models trained on unrestricted language.