Mapping Self-Attention Mechanisms to a Generalized Potts Model for Efficient Masked Language Modeling
A single layer of factored self-attention can exactly reconstruct the couplings of a generalized Potts model with two-body interactions between both sites and colors, making it an efficient building block for transformer models.