Core Concepts
Expert editing can significantly reduce the detectability of machine-generated text, while LLM-based editing is less effective at evading detection.
Stats
Beemo consists of 6.5k texts written by humans, generated by ten open-source instruction-finetuned LLMs, and edited by expert annotators.
Beemo comprises 13.1k machine-generated & LLM-edited texts.
The average edit percentage by expert annotators is 70%.
GPT-4o exhibits an average edit percentage of 60%.
Llama3.1-70B-Instruct demonstrates an average edit percentage of 80%.
Quotes
"This paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs and edited by expert annotators, who are well-experienced in refining LLM-generated content."
"Our key empirical results demonstrate that detectors can be confused by moderate expert edits, while editing with state-of-the-art LLMs does not significantly influences the detection behavior."
"Furthermore, we find that zero-shot detectors are more generalizable to both expert- and LLM-edited MGTs than pretrained detectors."