Conceitos Básicos
Large Language Models (LLMs) raise concerns about misinformation and personal information leakage, prompting the need for effective machine-generated text detection techniques.
Resumo
Abstract:
Large Language Models (LLMs) generate content across various domains.
Concerns about misinformation and personal information leakage.
Methods presented for SemEval2024 Task8 to detect machine-generated text.
Introduction:
LLMs pose challenges like fake news and plagiarism.
Identifying machine-generated text is complex due to similarities with human-written text.
Essence of LLM generated text detection:
LLMs can generate misinformation with legal and ethical implications.
Concerns in healthcare, public safety, education, finance, and intellectual property rights.
Tasks:
Subtask A: Binary classification of human-written vs. machine-generated text.
Subtask B: Multi-class classification of different sources of generation.
Related Work:
Statistical methods and neural networks used for detecting LLM-generated text.
Watermarking techniques employed for intellectual property protection.
Datasets:
Dataset provided includes human-written and machine-generated text from various sources.
Exploratory data analysis reveals variations in sentence length and token count.
System Overview:
Statistical, neural, and pre-trained models utilized for machine-generated text identification.
Results and Analysis:
Subtask A Mono-Lingual:
Ensemble models outperform statistical models on test set accuracy.
Subtask A Multi-Lingual:
BERT Multilingual Base model shows improved accuracy on test set after fine-tuning.
Subtask B:
RoBERTa Base OpenAI Detector achieves high accuracy on test set for multi-class classification task.
Conclusions:
Ensemble models are effective in mono-lingual data classification while GPT2-text models excel in multi-class classification tasks.
Limitations:
Computational constraints limited experiments with large language models, affecting generalization from development to test sets.
Estatísticas
Our methods obtain an accuracy of 86.9% on the test set of subtask-A mono and 83.7% for subtask-B.
We secured 24th rank out of 137 participants.
We observed that statistical models that performed modestly on the development set generalized effectively to the test set.
Some pre-trained language models struggled to generalize on the test set due to differing sources of training and development sets compared to the test set.
The ensemble approach obtains 70.8% accuracy on the development set and 65% accuracy on the test set.
RoBERTa Base OpenAI Detector gave 75.3% on the development set and 83.7% accuracy on the test set.
DistilRoBERTa base obtains 73.3% accuracy on the development set and 79.1% accuracy on the test set securing 17th rank out of 86 participants.