核心概念
UNER aims to provide high-quality, cross-lingually consistent annotations for multilingual NER research.
要約
Abstract:
- UNER introduces an open, community-driven project for gold-standard NER benchmarks in multiple languages.
- UNER v1 includes 19 datasets with named entities across 13 languages.
Introduction:
- High-quality data in many languages is crucial for multilingual NLP.
- Existing human-annotated NER datasets are limited, leading to the proposal of UNER.
Dataset Design Principles:
- UNER focuses on three entity types: Person (PER), Organization (ORG), and Location (LOC).
- Annotation schema inspired by Universal Dependencies aims for universality.
Dataset Annotation Process:
- Data sourced from Universal Dependency corpora.
- Annotators recruited from the multilingual NLP community via social media.
- Annotations collected using TALEN tool with secondary annotators for inter-annotator agreement.
Universal NER: Statistics and Analysis:
- Overview of UNER dataset covering 13 languages with diverse domains.
- Inter-Annotator Agreement analysis reveals differences in ORG vs LOC tags.
- Cross-Lingual Agreement analysis shows variance in entity counts and identities between languages.
Baselines for UNER:
- XLM-R model finetuned on various training configurations shows promising results.
Related Work:
- Mention of other efforts in adding NER layer to UD, multilingual NER resources, and modeling techniques.
Conclusion:
- UNER provides standardized evaluations for multilingual NER research.
統計
UNIVERSAL DEPENDENCIESのUDプロジェクトに基づいて、UNIVERSALNERプロジェクトは、13の言語をカバーするデータイニシアチブを導入します。