Kernekoncepter
Self Color Testing-based Substitution (SCTS) effectively evades watermark detection by substituting green tokens with non-green ones.
Resumé
The content discusses a novel approach, SCTS, to bypass watermarks in large language models using color-aware substitutions. It introduces the concept of Self Color Testing and demonstrates its effectiveness in evading watermark detection. The study compares SCTS with existing attack methods and evaluates its performance across different edit distance budgets and watermarked models. The results show that SCTS is superior in reducing AUROC to less than 0.5, making it a promising technique for watermark evasion.
Structure:
- Introduction to Watermarking Approaches (Kirchenbauer et al., 2023a)
- Existing Attack Methods: Paraphrasing and Prompting Strategies
- Limitations of Current Approaches: Dilution of Watermarks, Ineffectiveness under Edit Constraints
- Proposal of SCTS Attack Method: Self Color Testing-based Substitution Algorithm
- Analysis of SCTS Efficiency: Comparison with Baseline Methods, Semantic Preservation, Accuracy Evaluation
- Impact on Different Watermarked Models: Alignment and Instruction Fine-Tuning Influence on Success Rate
- Discussion on Limitations, Open Questions, and Potential Improvements
- Conclusion on the Effectiveness of SCTS in Evading Watermark Detection
Statistik
"In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work."
"Our evaluation compares SCTS and existing representative attack methods over a series of edit distance budgets."
"We conclude that across various settings, our approach is superior in reducing AUROC to less than 0.5 on two LLMs and two watermarking schemes."
Citater
"We propose the first “color-aware” attack method by prompting the LLM for (a seemingly) random generation to obtain color information."
"Our evaluation compares SCTS and existing representative attack methods over a series of edit distance budgets."
"Can one LLM query get more color information?"