CHisIEC: Ancient Chinese Historical Information Extraction Corpus
Concepts de base
CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field of ancient Chinese history.
Résumé
Directory:
Abstract
Introduction
Related Work
Information Extraction Challenges in Ancient Chinese History
CHisIEC Dataset Construction Process
Experimental Results for NER Task
Experimental Results for RE Task
Conclusion and Acknowledgments
Abstract:
NLP plays a crucial role in Digital Humanities.
CHisIEC dataset facilitates NER and RE tasks for ancient Chinese history.
Dataset spans 13 dynasties over 1830 years with 14,194 entities and 8,609 relations.
Introduction:
Preservation of historical heritage enhanced by technology.
Various technologies applied including NLP, CV, KG.
Recent studies focus on structuring cultural heritage documents.
Related Work:
Studies on entity annotation in Korean cultural heritage corpus.
Focus on historical newspapers entity annotation in European context.
Development of relation extraction datasets for ancient Chinese history.
Information Extraction Challenges in Ancient Chinese History:
Availability of labeled data crucial for advancing IE models.
Unique challenges due to extensive time span and linguistic heterogeneity.
CHisIEC Dataset Construction Process:
Selection of texts from Twenty-Four Histories spanning 1830 years.
Annotation process involving multi-person annotation and professional review.
Experimental Results for NER Task:
PLMs outperform LLMs due to incremental training in Ancient Chinese.
Fine-tuning more effective than partial modifications like LoRA or P-tuning.
Experimental Results for RE Task:
ChatGLM2 and Alpaca2 show comparable performance to PLMs.
GPT3.5 hindered by limited number of samples provided.
Conclusion and Acknowledgments:
Proposal of CHisIEC dataset for ancient Chinese history tasks.
Experiments validate applicability of dataset and LLM capabilities.