Core Concepts
BABEL BRIEFINGS is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide with English translations, designed to facilitate natural language processing and media studies.
Abstract
The BABEL BRIEFINGS dataset was collected in three steps:
Using the News API, the authors gathered available headlines once a day for each combination of 54 locations and 7 news categories, resulting in about 20,000 instances per day with duplicate headlines across locations and categories.
In a pre-processing step, duplicate occurrences of the same article were merged and listed as instances, with author names anonymized.
All non-English articles were translated to English using Google Translate for convenience.
The dataset is structured as a collection of 54 JSON files, one per location, with each file containing a list of headlines represented as JSON objects with properties like title, description, content, URL, author, source, and language information.
In total, the dataset contains 7,419,089 instances of 4,719,199 distinct articles across 30 languages, with the most common being English, Spanish, French, Chinese, and German.
The authors demonstrate the dataset's potential for analyzing global news coverage by clustering articles about the same event and visualizing the event signatures - the distribution and frequency of articles from different countries over time. This reveals interesting patterns, such as the qualitative differences between the coverage of "expected" events (with a clear lead-up and peak) and "unexpected" events (with a sudden spike and gradual decline).
The BABEL BRIEFINGS dataset enables a wide range of natural language processing tasks, as well as more nuanced analyses of cultural biases and differences in news reporting across the world.
Stats
The dataset contains a total of 7,419,089 instances of 4,719,199 distinct articles.