Conceitos essenciais
A watermarking framework is proposed to embed invisible signals into language model-generated text that are algorithmically detectable, enabling the identification of synthetic content.
Resumo
The authors propose a watermarking framework for proprietary language models that can be used to detect machine-generated text. The watermark is embedded with negligible impact on text quality and can be detected using an efficient open-source algorithm without access to the language model API or parameters.
The watermarking approach works by selecting a randomized set of "green" tokens before a word is generated, and then softly promoting the use of green tokens during sampling. A statistical test is proposed for detecting the watermark with interpretable p-values, and an information-theoretic framework is derived for analyzing the sensitivity of the watermark.
The watermark is tested using a multi-billion parameter model from the Open Pretrained Transformer (OPT) family, and its robustness and security are discussed. The watermark detection algorithm can be made public, enabling third parties to run it, or it can be kept private and run behind an API.
The authors analyze the impact of the watermark on text quality, showing that it has minimal effect on perplexity for high and low entropy text, while moderately impacting text of moderate entropy. They also explore the sensitivity of the watermark detection, demonstrating high true positive rates and low false positive rates, even for short text fragments.
Estatísticas
The watermarked text is expected to contain 9 "green" tokens, but it contains 28, with a probability of this happening by random chance of approximately 6×10^-14.
The z-score for the watermarked text is 7.4.
The number of tokens in the watermarked text is 36.
Citações
"The watermark can be algorithmically detected without any knowledge of the model parameters or access to the language model API."
"Watermarked text can be generated using a standard language model without re-training."
"The watermark is detectable from only a contiguous portion of the generated text."