Khái niệm cốt lõi
IEPILE is a comprehensive bilingual (English and Chinese) information extraction instruction corpus containing approximately 0.32B tokens, constructed by collecting and cleaning 33 existing datasets and introducing a schema-based instruction generation strategy to address the limitations of existing IE datasets.
Tóm tắt
The paper introduces IEPILE, a large-scale bilingual (English and Chinese) information extraction (IE) instruction corpus. The authors address the limitations of existing IE datasets, which are often small in scale, fragmented, and lack standardized schema.
Key highlights:
Data Collection and Cleaning: The authors collected 26 English and 7 Chinese IE datasets, covering tasks such as Named Entity Recognition (NER), Relation Extraction (RE), and Event Extraction (EE). They employed standardization procedures to maintain data quality and format uniformity.
Schema-Based Instruction Generation: The authors introduce a schema-based instruction generation strategy to address two key issues: (1) Schema Query Disparity - inconsistency in the number of schema queries between training and evaluation, and (2) Semantic Confusion - co-occurrence of semantically similar schemas within instructions. They propose a hard negative schema construction method and a batched instruction generation approach to mitigate these issues.
Experimental Results: The authors fine-tune large language models (LLMs) such as Baichuan, LLaMA, and Qwen using IEPILE and demonstrate improved zero-shot performance compared to baselines, verifying the effectiveness of the dataset.
Analysis: The authors investigate the impact of inconsistent schema queries and the importance of the hard negative schema dictionary in enhancing model performance, especially for tasks with semantically similar schemas.
The IEPILE dataset and pre-trained models are open-sourced, aiming to provide valuable support to the NLP community.
Thống kê
The IEPILE dataset contains approximately 0.32B tokens.
Trích dẫn
"To this end, we introduce IEPILE, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens."
"We introduce schema-based instruction generation to unearth a large-scale corpus."
"Experimental results on LLaMA, Baichuan and Qwen demonstrate that using IEPILE can enhance the performance of LLMs for IE, especially the zero-shot generalization."