Core Concepts
The author argues that current LLM watermarking schemes are more vulnerable than previously thought, highlighting the need for more robust schemes. The study introduces the concept of watermark stealing as a fundamental threat to existing schemes.
Abstract
The content discusses the vulnerability of large language model (LLM) watermarking schemes due to watermark stealing. It challenges common beliefs about LLM watermarking and stresses the need for more robust schemes. The study introduces an automated algorithm for watermark stealing and evaluates spoofing and scrubbing attacks on state-of-the-art schemes in realistic settings. The results show that attackers can successfully spoof and scrub watermarks with high success rates, indicating potential threats to model owners or clients.
The study covers various aspects such as spoofing attacks, scrubbing attacks, key contributions, experimental evaluations, related work, mitigations, broader impact, and references. It provides detailed insights into the vulnerabilities of LLM watermarking schemes and emphasizes the importance of developing secure and reliable watermarking methods.
Stats
For under $50 an attacker can both spoof and scrub state-of-the-art schemes with an average success rate of over 80%.
Scrubbing attacks on KGW2-SELFHASH can boost success rates from almost 0% to over 85%.
Quotes
"We make all our code and additional examples available at https://watermark-stealing.org."
"Our results challenge common beliefs about LLM watermarking, stressing the need for more robust schemes."