Core Concepts
The author proposes a method to assess linguistic diversity in multilingual NLP data sets by comparing them against a reference language sample. They introduce the Jaccard index as a means of quantifying this diversity and identifying missing linguistic features.
Abstract
The content discusses the importance of measuring linguistic diversity in multilingual NLP data sets. It introduces a method using the Jaccard index to compare data sets against a reference sample, highlighting missing linguistic features. The study emphasizes the need for more nuanced morphological features and text-based descriptors for accurate assessments.
The authors analyze popular multilingual data sets like UD, Bible100, mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, and XQuAD using their proposed method. They find discrepancies between the number of languages included and actual linguistic diversity. The study reveals that including more languages or families does not guarantee high linguistic diversity.
Key points include evaluating syntactic and morphological diversity using text-based features like mean word length. The findings show that certain languages with rich morphology are underrepresented in existing data sets. The study suggests focusing on representing languages with long words for better coverage in multilingual NLP.
Stats
Universal Dependencies (UD) contains 106 languages from 20 families.
Bible 100 dataset includes 103 languages from 30 families.
mBERT training data consists of 97 distinct languages from 15 families.
XTREME dataset comprises 40 languages from 14 families.
XGLUE dataset has 19 languages from 7 families.
TeDDi sample includes text samples from 89 languages.
XCOPA dataset contains data from 11 different language families.
TyDiQA dataset covers information on 11 languages from various families.
XQuAD dataset involves 12 languages across 6 families.
Quotes
"The aim is to know how NLP technology generalizes across diverse languages."
"Our proposals intend to help researchers make informed choices when designing multilingual datasets."