Core Concepts
A novel computational framework for modeling the quality of Wikipedia articles across languages using language-agnostic structural features.
Abstract
The paper presents a framework for modeling the quality of Wikipedia articles across different language editions using language-agnostic features. The key highlights are:
The framework is based on 6 language-agnostic structural features extracted from the Wikitext markup of articles, including page length, number of references, sections, wikilinks, categories, and media files.
The framework uses a heuristic approach combining universal feature weights and a normalization criterion derived from each language version to assess article quality.
The authors apply the framework to the full revision history of articles across all Wikipedia language editions, generating datasets of feature values and predicted quality scores.
The authors evaluate their framework by comparing the predicted quality scores against ground-truth labels from the English and French Wikipedia. They also benchmark their approach against the ORES quality prediction system and a Random Forest model.
The datasets generated from this work are made publicly available to support diverse research on Wikipedia content across languages, including tasks like analyzing content gaps, measuring article reliability, and quantifying the impact of edits on quality.
The authors discuss the ethical and FAIR considerations in releasing these datasets to ensure accessibility and reusability.
Stats
"Wikipedia is not only one of the most popular websites but also one of the largest free knowledge repositories in the world."
"Millions of people access Wikipedia daily in search of information on a multitude of topics."
"Several search engines and AI-powered services rely on data extracted from Wikipedia articles."
Quotes
"Wikipedia is the largest web repository of free knowledge."
"Volunteer editors devote time and effort to creating and expanding articles in more than 300 language editions."
"To overcome this limitation, we propose a novel computational framework for modeling the quality of Wikipedia articles."