toplogo
Resources
Sign In

Sāmayik: A Contemporary English-Sanskrit Parallel Dataset for Machine Translation


Core Concepts
Sāmayik is a novel dataset of around 53,000 parallel English-Sanskrit sentences, focused on contemporary prose writing, to address the lack of digitized content in Sanskrit.
Abstract
Sāmayik is a novel dataset that aims to address the lack of digitized content in Sanskrit, which is still considered a low-resource language. Unlike existing datasets that predominantly focus on classical poetry, Sāmayik covers contemporary prose writing across diverse domains such as language instruction material, textual teaching pedagogy, and online tutorials. The dataset is curated from five different sources: The New Testament of the Bible, with the Sanskrit version originally published in 1851. Mann ki Baat (MKB), an ongoing monthly radio podcast hosted by the Prime Minister of India, with expert Sanskrit translations. Gītā Sopānaṁ, a book published in 2009 for teaching Sanskrit to beginners, with in-house translations to English. Spoken Tutorials, a corpus of video tutorials with both English and Sanskrit transcripts. NIOS (National Institute of Open Schooling) study materials, which include courses on the Indian knowledge tradition, with parallel English and Sanskrit content. The dataset is evaluated using four pre-trained multilingual models: ByT5, mBART, IndicBART, and IndicTrans. The results show that models trained on Sāmayik outperform those trained on existing datasets, such as Itihasa, when evaluated on out-of-domain contemporary content like MKB. This highlights the importance of a dataset like Sāmayik that focuses on contemporary usage of Sanskrit.
Stats
The New Testament of the Bible contains 7,838 parallel sentences. Mann ki Baat (MKB) contains 4,047 parallel sentences with 47,843 words. Gītā Sopānaṁ contains 6,130 parallel sentences with 6,465 unique words. Spoken Tutorials contains 23,835 parallel sentences with 237,449 words. NIOS contains 11,356 parallel sentences with 105,178 words and 30,966 unique words.
Quotes
"Sanskrit is estimated to have around 30 million extant manuscripts fit for digitization. Moreover, it has more than two million active speakers." "Sāmayik is a Sanskrit term that translates to the 'sayings of the contemporary world'."

Key Insights Distilled From

by Ayush Mahesh... at arxiv.org 04-01-2024

https://arxiv.org/pdf/2305.14004.pdf
Sāmayik

Deeper Inquiries

What are the potential applications of the Sāmayik dataset beyond machine translation, such as in language learning or computational linguistics research

The Sāmayik dataset has various potential applications beyond machine translation. One significant application is in language learning. Language learners, especially those interested in Sanskrit, can benefit from this dataset by using it for practice, vocabulary building, and understanding the nuances of contemporary Sanskrit prose. The dataset can serve as a valuable resource for language educators to create teaching materials, exercises, and assessments. Additionally, researchers in computational linguistics can utilize the dataset for studying language patterns, syntax, and semantics in contemporary Sanskrit, contributing to the broader field of natural language processing.

How can the dataset be further expanded or improved to better represent the diversity of contemporary Sanskrit prose writing

To further enhance the Sāmayik dataset and better represent the diversity of contemporary Sanskrit prose writing, several strategies can be employed. Firstly, expanding the dataset by including more sources from different genres such as modern literature, news articles, scientific publications, and social media content can provide a more comprehensive view of contemporary Sanskrit usage. Additionally, incorporating a wider range of writing styles, dialects, and regional variations can help capture the richness and diversity of the language. Collaborating with native speakers, scholars, and subject matter experts to validate and enrich the dataset can ensure its authenticity and relevance. Moreover, incorporating audio-visual content, dialogues, and conversational data can add depth and context to the dataset, making it more robust for various applications.

What are the implications of the findings regarding the importance of using contemporary datasets for machine translation, and how can this insight be applied to other low-resource language pairs

The findings regarding the importance of using contemporary datasets for machine translation have significant implications for low-resource language pairs and beyond. The insight underscores the necessity of training translation models on datasets that reflect the current usage and linguistic evolution of a language. This approach can lead to more accurate and contextually relevant translations, especially for languages like Sanskrit that have a rich heritage and ongoing contemporary usage. To apply this insight to other low-resource language pairs, researchers and developers can focus on curating datasets that capture the modern-day language dynamics, including slang, colloquialisms, and domain-specific terminology. By prioritizing contemporary datasets, machine translation models can better adapt to real-world language scenarios and improve their performance in diverse contexts.
0