Core Concepts
The authors present a parameter-efficient fine-tuning approach based on the adapters framework to adapt the XLM-RoBERTa language model for various natural language processing tasks on 16 ancient and historical languages.
Abstract
The authors describe their submission to the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages. They developed a simple, uniform, and computationally lightweight approach based on the adapters framework using parameter-efficient fine-tuning.
The key highlights are:
The authors applied the same adapter-based approach uniformly to all tasks (morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling) and 16 languages by fine-tuning stacked language- and task-specific adapters on top of the XLM-RoBERTa model.
For languages with scripts underrepresented in the XLM-RoBERTa vocabulary, the authors trained custom tokenizers and initialized the embedding layers using lexical overlap with the original multilingual embeddings.
The authors' submission obtained an overall second place out of three submissions, with the first place in the word-level gap-filling task.
The results demonstrate the feasibility of adapting language models pre-trained on modern languages to historical and ancient languages via the adapter training approach, despite the limited data available for these low-resource languages.
The authors note that the performance is lower for languages requiring custom tokenizers and embeddings, suggesting that more sophisticated approaches to embedding initialization and tokenizer training could further improve the results.
The authors also discuss the strengths and limitations of their algorithmic approach to character-level gap-filling, which performed well except for Classical Chinese.
Stats
The dataset provided by the organizers comprises 16 ancient and historical languages spanning several historical epochs and upper bounded by 1700 CE.
Quotes
"The application of natural language processing techniques and pre-trained language models to analysis of ancient and historical languages is a compelling subject of research that has been so far overlooked."
"Large pre-trained language models, however, are predominantly trained on corpora of modern languages, with few exceptions such as Latin-BERT."
"Our submission obtained an overall second place out of three submissions, with the first place in word-level gap-filling."