Core Concepts
Developing automated systems to detect and mitigate machine-generated content is crucial in distinguishing between human-written and machine-generated text.
Abstract
1. Abstract:
MasonTigers entry at SemEval-2024 Task 8 focused on Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection.
Utilized ensemble transformer models, sentence transformers, and statistical machine learning approaches.
2. Introduction:
Large language models like GPT-3.5 raise concerns about potential misuse of machine-generated content.
3. Related Works:
Various studies highlight the challenges in detecting machine-generated text accurately.
4. Datasets:
Data collected from various sources including Wikipedia, Reddit, arXiv, etc., for different languages.
5. Experimental Setup:
Data preprocessing involved removing special characters and hyperlinks while maintaining punctuation marks' integrity.
6. Results:
Different models achieved varying accuracies in subtasks A, B, and C with ensemble methods showing effectiveness.
7. Error Analysis:
Models showed proficiency but encountered false positives and misclassifications in distinguishing human-written from machine-generated text.
8. Conclusion:
Ensemble strategies with transformer models proved effective in navigating the complexities of detecting machine-generated content.
Stats
"Ensemble methods outperform individual models significantly."
"Our weighted ensemble approaches achieve accuracies of 74%, 60%, and 65%."
"RoBERTa demonstrates superior accuracy compared to DistilBERT."
"ELECTRA outperforms RoBERTa and DistilBERT."
"DeBERTa-v3 excels in predicting chatGPT-generated texts."