核心概念
A robust and accurate system for detecting machine-generated text from multiple generators across different domains.
摘要
The paper presents a systematic study on detecting machine-generated text from multiple generators and domains for SemEval-2024 Task 8. The key highlights are:
-
Data Processing:
- Merged the training data from Subtask A and B to create a unified dataset, removing duplicates and ensuring all texts are labeled based on Subtask B.
- Analyzed the token length distribution of the training and development sets, and tested different input sizes for the Longformer model.
-
Fine-tuning Transformer-based Models:
- Explored encoder-only, decoder-only, and encoder-decoder transformer-based models, including RoBERTa-large, DeBERTa-large, Longformer, XLNet-large, and T5.
- Identified that encoder-only models, particularly RoBERTa-large, performed exceptionally well on this task.
-
Class-balanced Weighted Loss:
- Addressed the issue of data imbalance across different classes by employing a weighted cross-entropy loss function.
-
Soft Voting Ensemble:
- Developed a soft voting ensemble approach to combine the predictions of multiple base models, leveraging their individual strengths to improve robustness and generalization.
The proposed system achieved state-of-the-art performance on Subtask B, ranking first in the final test.
统计
The training data consists of 127,755 items, with the following distribution:
C0 (human-written): 63,351
C1 (ChatGPT): 13,839
C2 (Cohere): 13,178
C3 (Davinci): 13,843
C4 (BLOOMZ): 9,998
C5 (Dolly): 13,546
The development set contains 3,000 items.
引用
"To prevent the misuse of LLMs and improve the iterative refinement of AI tools, it is crucial to distinguish between machine-generated and human-written text."
"Our system formulated a SOTA benchmark on the task."