Large language models possess substantial capacity to memorize training examples, influenced by various factors such as model size, context length, and duplication frequency.
Large language models can memorize portions of their training data, which raises concerns about fair use of copyrighted data. We propose a new definition of memorization based on adversarial compression, where a model is considered to have memorized a piece of text if it can be reproduced using a shorter prompt than the original text.