Core Concepts
L3Cube-MahaNews is the largest supervised Marathi text classification dataset, containing over 1.05 million records across 12 diverse categories, designed for short text, long documents, and medium paragraphs.
Abstract
The L3Cube-MahaNews dataset is a comprehensive Marathi text classification corpus consisting of three sub-datasets:
Short Headlines Classification (SHC): This dataset contains news article headlines with their corresponding categorical labels.
Long Paragraph Classification (LPC): This dataset includes news article paragraphs with their categorical labels.
Long Document Classification (LDC): This dataset comprises full news articles with their categorical labels.
The dataset covers 12 diverse categories, including Auto, Bhakti, Crime, Education, Fashion, Health, International, Manoranjan, Politics, Sports, Tech, and Travel. This is the largest supervised Marathi text classification dataset available, with over 1.05 million records, significantly larger than the existing Marathi datasets.
The authors provide baseline results using state-of-the-art pre-trained BERT models, including MahaBERT, IndicBERT, and MuRIL. The monolingual MahaBERT model outperforms the multilingual models on all three sub-datasets. The results demonstrate the need for diverse, high-quality Marathi datasets to support the development of advanced NLP models for this low-resource language.
Stats
The L3Cube-MahaNews dataset contains a total of 1,08,643 records derived from 27,525 news articles. The SHC and LDC datasets have 27,525 records each, while the LPC dataset has 53,593 records.
The average word count per record is 12 for SHC, 150 for LPC, and 350 for LDC.
Quotes
"The availability of text or topic classification datasets in the low-resource Marathi language is limited, typically consisting of fewer than 4 target labels, with some achieving nearly perfect accuracy."
"The monolingual MahaBERT model outperforms all others on every dataset."