Transformer Language Models Can Represent n-gram Language Models
Transformer language models using hard or sparse attention mechanisms can exactly represent any n-gram language model, providing a concrete lower bound on their probabilistic representational capacity.