The paper presents a system for detecting machine-generated text (MGT) in the SemEval-2024 Task 8 on "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection". The authors investigate the impact of various linguistic features, including text statistics, readability, stylometry, lexical diversity, rhetorical structure, and entity grid, on the detection task. They find that a combination of embeddings from a fine-tuned RoBERTa-base model and lexical diversity features achieves the best performance, outperforming a competitive baseline. The authors also observe that a model relying solely on linguistic features, such as stylometry and entity grid, can perform on par with the baseline. Additionally, the authors discuss the importance of careful selection of the training data, noting that using MGTs from all domains and human-written texts (HWTs) only from the WikiHow domain leads to improved performance. The results demonstrate the generalizability of the proposed approach, as it achieves high accuracy on unseen language models and domains.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Kseniia Petu... lúc arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.05483.pdfYêu cầu sâu hơn