The paper presents a system for detecting machine-generated text (MGT) in the SemEval-2024 Task 8 on "Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection". The authors investigate the impact of various linguistic features, including text statistics, readability, stylometry, lexical diversity, rhetorical structure, and entity grid, on the detection task. They find that a combination of embeddings from a fine-tuned RoBERTa-base model and lexical diversity features achieves the best performance, outperforming a competitive baseline. The authors also observe that a model relying solely on linguistic features, such as stylometry and entity grid, can perform on par with the baseline. Additionally, the authors discuss the importance of careful selection of the training data, noting that using MGTs from all domains and human-written texts (HWTs) only from the WikiHow domain leads to improved performance. The results demonstrate the generalizability of the proposed approach, as it achieves high accuracy on unseen language models and domains.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы