Core Concepts
The core message of this paper is to mitigate the problem of template-based translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems.
Abstract
The paper explores the content of the three Arabic Wikipedia editions (Arabic Wikipedia, Egyptian Arabic Wikipedia, and Moroccan Arabic Wikipedia) in terms of density, quality, and human contributions. It highlights how the template-based translation that occurred on the Egyptian Wikipedia produces unrepresentative content.
The key highlights and insights from the exploratory analysis are:
The Egyptian Wikipedia has a greater number of total articles than the Arabic Wikipedia, but a substantial portion of these articles (46%) are under 50 tokens, indicating limited and shallow content.
The Egyptian Wikipedia exhibits lower lexical richness and diversity compared to the other Arabic Wikipedia editions, suggesting the template translation produced poor-quality content.
The Egyptian Wikipedia has a high number of duplicate n-grams, especially for n>=5, indicating the use of templates in the translation process.
The Egyptian Wikipedia's articles are 100% created by humans, but 42.72% of them are automatically template-translated from English without human supervision.
The paper then attempts to build powerful multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically. The key findings are:
Supervised classification algorithms, particularly ensemble methods like Random Forest and XGBoost, outperform unsupervised clustering algorithms in detecting the template-translated articles.
The metadata features related to the total edits, total editors, total bytes, total characters, and total words are effective in training the classifiers to identify the template-translated articles.
The paper publicly deploys the best-performing classifier, XGBoost, as an online application called "Egyptian Wikipedia Scanner" and releases the extracted, filtered, and labeled datasets to the research community.
Finally, the paper discusses the negative implications of the template-based translations on the Egyptian Wikipedia, including societal, representation, and performance issues. It argues that such practices could misrepresent the native speakers and their culture and negatively impact the performance of language models and NLP tasks trained on these corpora.
Stats
The Egyptian Arabic Wikipedia edition has nearly 1.6 million total articles, of which 46% (741K) are under 50 tokens per article.
The Egyptian Arabic Wikipedia has the lowest mean values of total characters (610) and total tokens/words (100) compared to the Arabic Wikipedia and Moroccan Wikipedia.
The Egyptian Arabic Wikipedia has a high number of duplicate n-grams, especially for n>=5, indicating the use of templates in the translation process.
42.72% of the articles in the Egyptian Arabic Wikipedia are automatically template-translated from English without human supervision.
Quotes
"We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles' metadata to detect the template-translated articles automatically."
"We then publicly deploy and host the best-performing classifier, XGBoost, as an online application called Egyptian Wikipedia Scanner and release the extracted, filtered, and labeled datasets to the research community to benefit from our datasets and the online, web-based detection system."
"We argue that such automatic template-based translations without humans in the loop could misrepresent the Egyptian Arabic native speakers, where instead of the Egyptian people enriching the content of Wikipedia by sharing their voices, opinions, knowledge, perspectives, and experiences, a couple of registered users automated the creation and translation of more than a million and a half million articles (95.56%) from English on their behalf without supervision or revision of the translated articles."