Core Concepts
Large multilingual language models can effectively represent orthographic and lexical variation across Occitan dialects without the need for extensive data normalization.
Abstract
This study investigates the ability of a fine-tuned multilingual BERT (mBERT) model to represent the orthographic and lexical variation across four dialects of Occitan, a low-resource Western Romance language. The authors first compile a parallel lexicon covering four Occitan dialects (Lengadocian, Lemosin, Provençau, and Gascon) to enable controlled evaluation. They then fine-tune mBERT on a multi-dialect Occitan corpus and conduct a series of experiments:
Intrinsic Evaluation:
Analogy Computation: The fine-tuned model performs poorly on semantic analogies but better on syntactic analogies, suggesting limitations in representing semantic relations across dialects.
Lengadocian Lexicon Induction: The fine-tuned model can more accurately induce the Lengadocian lexicon from other dialects when the words have high surface similarity, indicating that surface similarity is an important factor.
Extrinsic Evaluation:
Part-of-Speech Tagging: The fine-tuned model achieves high accuracy on PoS tagging, even when trained only on data from the Lengadocian dialect and tested on all four dialects.
Universal Dependency Parsing: The fine-tuned model's performance on UD parsing is robust to dialectal variation, though it struggles more with the Provençau dialect.
The results suggest that large multilingual language models can effectively represent orthographic and lexical variation across Occitan dialects without the need for extensive data normalization during pre-processing. However, the model still struggles to fully capture semantic relations between parallel lexical items with low surface similarity across dialects.
Stats
Occitan has six main dialects, with significant lexical and orthographic variation between them.
The parallel lexicon compiled for this study contains over 2,200 entries across four Occitan dialects.
The fine-tuning corpus contains 386,552 lines (10,941,124 tokens) of Occitan data from Wikipedia discussions and parallel corpora.
Quotes
"Effectively normalizing textual data poses a considerable challenge, especially for low-resource languages lacking standardized writing systems."
"Our findings suggest that large multilingual models minimize the need for spelling normalization during pre-processing."