Core Concepts
NSina introduces a large news corpus for Sinhala, addressing challenges in adapting LLMs to low-resource languages.
Abstract
1. Introduction
Large language models (LLMs) have revolutionized natural language processing (NLP).
LLMs excel in high-resource languages but face challenges in low-resource languages like Sinhala.
NSina aims to provide a solution by offering a comprehensive news corpus and NLP tasks.
2. Dataset Construction
Data collected from popular Sri Lankan news sources.
NSina consists of 506,932 news articles with varied token frequencies.
"Lankadeepa" and "Hiru News" contribute the most to the corpus.
3. Tasks
Three NLP tasks created from NSina: news media identification, news category prediction, and news headline generation.
Models like XLM-R Large and SinBERT evaluated on each task.
4. Conclusion
NSina offers valuable resources for training LLMs in Sinhala.
Transformer models show promise but struggle in natural language generation tasks.
Stats
NSinaは50万以上の記事から成る包括的なニュースコーパスです。
"Lankadeepa"と"Hiru News"がコーパスに最も貢献しています。