Localizing Paragraph Memorization in Language Models
Memorized paragraphs have distinguishable spatial patterns in model gradients, with larger gradients in lower layers compared to non-memorized paragraphs. A specific attention head in the first layer appears to be strongly involved in memorizing rare tokens.