toplogo
Đăng nhập

Comprehensive Indic Language News Classification Datasets: L3Cube-IndicNews


Khái niệm cốt lõi
L3Cube-IndicNews is a comprehensive dataset for text classification in 11 prominent Indic languages, including short headlines, long documents, and long paragraphs, with consistent labeling across diverse news categories.
Tóm tắt
L3Cube-IndicNews is a multilingual text classification dataset aimed at curating high-quality datasets for 11 Indian regional languages, with a focus on news headlines and articles. The dataset consists of three distinct sub-datasets: Short Headlines Classification (SHC): This dataset contains news headlines paired with their respective categorical labels. Long Paragraph Classification (LPC): This dataset contains news article sub-sections with their corresponding categorical labels. Long Document Classification (LDC): This dataset contains full news articles with their corresponding categorical labels. Each language dataset has a minimum of 26,000 rows and 10-12 news categories. The datasets were collected from reputable news sources and underwent careful preprocessing to ensure quality and consistency. The authors evaluated the datasets using four different models: L3Cube monolingual BERT, L3Cube monolingual Indic Sentence BERT (IndicSBERT), and IndicBERT. The results show that the monolingual BERT models generally outperform the multilingual models, with the LDC dataset exhibiting the highest accuracy across the board. The L3Cube-IndicNews dataset and the corresponding models are publicly available, contributing significantly to the expansion of text classification resources for Indic languages and enabling the development of robust topic classification models.
Thống kê
The LDC dataset achieved the highest accuracy across all models, indicating that longer documents provide more informative content for classification. The SHC dataset exhibited relatively lower accuracy scores, suggesting that news headlines can sometimes be more generalized, leading to increased model confusion. The L3Cube monolingual BERT models outperformed the multilingual models in most cases, highlighting the importance of language-specific model fine-tuning.
Trích dẫn
"L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category." "We maintain consistent labeling across all 3 datasets for in-depth length-based analysis."

Thông tin chi tiết chính được chắt lọc từ

by Aishwarya Mi... lúc arxiv.org 04-30-2024

https://arxiv.org/pdf/2401.02254.pdf
L3Cube-IndicNews: News-based Short Text and Long Document Classification  Datasets in Indic Languages

Yêu cầu sâu hơn

How can the L3Cube-IndicNews dataset be extended to include more diverse news categories or languages to further enhance its utility

To extend the utility of the L3Cube-IndicNews dataset, several strategies can be implemented. Firstly, incorporating more diverse news categories can enhance the dataset's coverage and applicability. This can be achieved by scraping data from additional reputable news sources that cover a wider range of topics such as technology, environment, culture, and more. Moreover, including lesser-known regional languages and dialects can further diversify the dataset, making it more representative of India's linguistic landscape. Collaborating with linguistic experts and journalists fluent in these languages can ensure the accuracy and relevance of the added categories and languages. Additionally, implementing a crowdsourcing approach to gather labeled data from native speakers can help in expanding the dataset efficiently while maintaining quality standards.

What are the potential challenges in developing cross-lingual models that can effectively leverage the high overlap of labels across the L3Cube-IndicNews datasets

Developing cross-lingual models that effectively leverage the high overlap of labels across the L3Cube-IndicNews datasets can pose several challenges. One key challenge is the linguistic diversity and complexity of Indic languages, which may require specialized preprocessing techniques to handle different writing scripts, grammar rules, and vocabulary. Ensuring the models can accurately capture the nuances and context-specific features of each language is crucial for cross-lingual performance. Another challenge is the imbalance in data distribution across languages and categories, which can lead to biased models. Addressing this imbalance through data augmentation, sampling techniques, or fine-tuning strategies is essential to improve model robustness. Furthermore, aligning the embeddings and representations of different languages while maintaining language-specific characteristics is a complex task that requires careful optimization and tuning.

How can the insights gained from the performance differences between the monolingual and multilingual models be applied to improve the overall text classification capabilities for Indic languages

Insights gained from the performance differences between monolingual and multilingual models can be leveraged to enhance text classification capabilities for Indic languages. Firstly, understanding the strengths and weaknesses of each model type can guide the selection of the most suitable approach based on the specific task requirements. For instance, if the focus is on individual language proficiency, monolingual models may be preferred for higher accuracy. On the other hand, if the goal is to handle multiple languages efficiently, multilingual models can offer broader coverage. Additionally, fine-tuning models on mixed datasets comprising different document lengths (SHC, LPC, LDC) can improve the models' adaptability to varying text structures. Leveraging transfer learning techniques to transfer knowledge from high-resource languages to low-resource languages can also enhance the performance of models across all languages in the dataset. Regular evaluation and benchmarking of models on diverse datasets can provide valuable insights for continuous improvement and optimization.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star