Bibliographic Information: Sindane, T., & Marivate, V. (2024). From N-grams to Pre-trained Multilingual Models For Language Identification. arXiv preprint arXiv:2410.08728v1.
Research Objective: This paper investigates the effectiveness of N-gram models and pre-trained multilingual models for language identification (LID) across 11 South African languages, focusing on the challenges posed by low-resource languages.
Methodology: The researchers utilized the Vuk’zenzele and National Centre for Human Language Technology (NCHLT) corpora, comparing the performance of character-based N-gram models (Bi-gram, Tri-gram, Quad-gram), traditional machine learning techniques (Naive Bayes, Support Vector Machines, Logistic Regression), and pre-trained multilingual models (mBERT, XLM-r, RemBERT, AfriBERTa, Afro-XLMr, AfroLM, Serengeti). They also evaluated publicly available LID tools like CLD V3, AfroLID, GlotLID, and OpenLID.
Key Findings:
Main Conclusions: The study concludes that pre-trained multilingual models, particularly those pre-trained on South African languages, offer significant advantages for LID in low-resource settings. The authors suggest that future research should explore the use of precision as an evaluation metric and investigate resource-efficient alternatives like parameter transfer and adaptation.
Significance: This research contributes valuable insights into the application of LID techniques for South African languages, which is crucial for the development of language technologies and resources for these under-resourced languages.
Limitations and Future Research: The study acknowledges limitations in exploring word embeddings, deep neural networks, and the impact of parameter 'k' in N-gram models. Future research could address these limitations and investigate the performance of models on human-generated text and the potential of smaller, resource-efficient models.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Thapelo Sind... kl. arxiv.org 10-14-2024
https://arxiv.org/pdf/2410.08728.pdfDybere Forespørgsler