核心概念
Introducing NSina, a large news corpus for Sinhala language processing, addresses challenges in adapting LLMs to low-resource languages.
統計資料
The OSCAR 23.01 multilingual corpus contains only 2.6GB of Sinhala text which is less than 1% of the total dataset.
Previous Sinhala news corpus, SinMin, was only 1.01 GB compared to NSINa's 1.87 GB size.