核心概念
Combining linguistic features and language model embeddings can effectively distinguish machine-generated text from human-written text, even across unseen language models and domains.
要約
The paper presents a system for detecting machine-generated text (MGT) in the SemEval-2024 Task 8 on "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection". The authors investigate the impact of various linguistic features, including text statistics, readability, stylometry, lexical diversity, rhetorical structure, and entity grid, on the detection task. They find that a combination of embeddings from a fine-tuned RoBERTa-base model and lexical diversity features achieves the best performance, outperforming a competitive baseline. The authors also observe that a model relying solely on linguistic features, such as stylometry and entity grid, can perform on par with the baseline. Additionally, the authors discuss the importance of careful selection of the training data, noting that using MGTs from all domains and human-written texts (HWTs) only from the WikiHow domain leads to improved performance. The results demonstrate the generalizability of the proposed approach, as it achieves high accuracy on unseen language models and domains.
統計
The number of difficult words (words with more than two syllables and not in the list of easy words) is lower in HWTs compared to MGTs across all language models.
The raw lexicon count (unique words) and raw sentence count are higher in HWTs compared to MGTs across all language models.
The Flesch Reading Ease Test, Flesch-Kincaid Grade Level Test, and Linsear Write Metric indicate that HWTs are generally more readable than MGTs.
引用
"Our results suggest that our best model, which uses diversity features and embeddings, outperforms a very competitive baseline introduced in this task (Wang et al., 2024), yielding an accuracy of 0.95 on the development and 0.91 on the test set."
"It is the only feature type that increases the accuracy obtained with embeddings only."
"Stylometry features turn out to be the best linguistic feature type when used on their own: the accuracy with sty is 0.68 vs. 0.6 with feat."