The authors describe their submission to the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. They developed a simple, uniform, and computationally lightweight approach based on the adapters framework using parameter-efficient fine-tuning.
The key highlights are:
The authors applied the same adapter-based approach uniformly to all tasks (morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling) and 16 languages by fine-tuning stacked language- and task-specific adapters on top of the XLM-RoBERTa model.
For languages with scripts underrepresented in the XLM-RoBERTa vocabulary, the authors trained custom tokenizers and initialized the embedding layers using lexical overlap with the original multilingual embeddings.
The authors' submission obtained an overall second place out of three submissions, with the first place in the word-level gap-filling task.
The results demonstrate the feasibility of adapting language models pre-trained on modern languages to historical and ancient languages via the adapter training approach, despite the limited data available for these low-resource languages.
The authors note that the performance is lower for languages requiring custom tokenizers and embeddings, suggesting that more sophisticated approaches to embedding initialization and tokenizer training could further improve the results.
The authors also discuss the strengths and limitations of their algorithmic approach to character-level gap-filling, which performed well except for Classical Chinese.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Aleksei Dork... pada arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.12845.pdfPertanyaan yang Lebih Dalam