Core Concepts

Expert editing can significantly reduce the detectability of machine-generated text, while LLM-based editing is less effective at evading detection.

Abstract

Bibliographic Information: Artemova, E., Lucas, J., Venkatraman, S., Lee, J., Tilga, S., Uchendu, A., & Mikhailov, V. (2024). Beemo: Benchmark of Expert-edited Machine-generated Outputs. arXiv preprint arXiv:2411.04032v1.
Research Objective: This paper introduces Beemo, a new benchmark dataset for evaluating machine-generated text (MGT) detection, specifically focusing on the impact of expert and LLM-based editing on detectability. The study aims to assess the robustness of various MGT detectors against different editing approaches.
Methodology: The researchers collected prompts and human-written responses from the No Robots dataset. They then used ten different instruction-finetuned LLMs to generate responses to these prompts. These machine-generated responses were further edited by human experts and two state-of-the-art LLMs (Llama3.1-70B-Instruct and GPT-4o) using various editing prompts. The resulting dataset, Beemo, comprises human-written, machine-generated, expert-edited, and LLM-edited texts across five categories: open-ended generation, rewriting, summarization, and open & closed question answering. The researchers evaluated 33 configurations of zero-shot and pretrained MGT detectors on Beemo in seven binary classification tasks, using AUROC as the evaluation metric.
Key Findings: The study found that expert editing can significantly reduce the detectability of machine-generated text, leading to a decrease in AUROC scores for many MGT detectors. LLM-based editing, while less effective than expert editing, also poses challenges for some detectors. Zero-shot detectors generally outperformed pretrained detectors in terms of generalizability to edited texts. The performance of detectors varied across different categories, with open-ended generation and open QA being easier to detect than rewriting, summarization, and closed QA.
Main Conclusions: The authors conclude that expert editing can effectively evade current MGT detection methods, highlighting the need for more robust detectors. LLM-based editing, while less effective than human editing, still presents challenges for detection. The Beemo dataset provides a valuable resource for developing and evaluating new MGT detection techniques that are resilient to various editing strategies.
Significance: This research significantly contributes to the field of MGT detection by providing a realistic benchmark dataset that considers the impact of editing. It highlights the limitations of current detection methods and emphasizes the need for further research to address the challenges posed by edited MGTs.
Limitations and Future Research: The study acknowledges limitations in the evaluation design, including the dependence of LLM-based editing on prompt engineering and the limited scope of MGT detectors considered. Future research directions include exploring other MGT detection tasks, establishing human baselines, and investigating the impact of training data on detector robustness.

Beemo: Benchmark of Expert-edited Machine-generated Outputs

Stats

Beemo consists of 6.5k texts written by humans, generated by ten open-source instruction-finetuned LLMs, and edited by expert annotators.
Beemo comprises 13.1k machine-generated & LLM-edited texts.
The average edit percentage by expert annotators is 70%.
GPT-4o exhibits an average edit percentage of 60%.
Llama3.1-70B-Instruct demonstrates an average edit percentage of 80%.

Quotes

"This paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs and edited by expert annotators, who are well-experienced in refining LLM-generated content."
"Our key empirical results demonstrate that detectors can be confused by moderate expert edits, while editing with state-of-the-art LLMs does not significantly influences the detection behavior."
"Furthermore, we find that zero-shot detectors are more generalizable to both expert- and LLM-edited MGTs than pretrained detectors."

Benchmarking Machine-Generated Text Detection with Expert and LLM Edited Outputs: The Beemo Dataset

Customize Summary

Rewrite with AI

Generate Citations

Translate Source

Generate MindMap

Visit Source

Beemo: Benchmark of Expert-edited Machine-generated Outputs

Get PDF Summary in Seconds