Alapfogalmak
The INSTRUCTIE dataset provides a comprehensive bilingual (Chinese and English) resource for training large language models to perform instruction-based information extraction tasks across diverse domains.
Kivonat
The INSTRUCTIE dataset is introduced as a solution to address the limitations of existing information extraction (IE) datasets, which often have limited coverage and high construction costs. The dataset is constructed using the KG2Instruction framework, which automatically generates relational triples by aligning knowledge graphs with existing corpora, supplementing missing triples with a trained IE model, and filtering out unreal triples using natural language inference.
The INSTRUCTIE dataset covers 12 diverse domains and 123 types of relations, containing 174,670 Chinese instances and 189,406 English instances. The authors evaluate various large language models (e.g., Baichuan2, LLaMA2, mT5) on INSTRUCTIE under multiple settings, including zero-shot learning, in-context learning, and fine-tuning. The results demonstrate that large language models fine-tuned on INSTRUCTIE can enhance their performance in instruction-based IE tasks and show certain advantages in generalizing to other domains.
The authors also conduct an in-depth analysis of the dataset, including ablation studies on the KG2Instruction framework, evaluation of the models' generalization to unseen schemas, and error analysis. The findings suggest that the incorporation of LLMs and natural language inference models in the data generation process significantly improves the quality of the dataset, and the instruction-tuned models exhibit improved performance in entity recognition and relation extraction tasks.
Statisztikák
The INSTRUCTIE dataset covers 12 diverse domains and 123 types of relations.
The dataset contains 174,670 Chinese instances and 189,406 English instances.
The average number of triples per instance ranges from 1.78 to 11.96, depending on the domain.
The average token count per instance ranges from 80.67 to 267.37, depending on the domain.
Idézetek
"Large language models can perform well on general natural language tasks, but their effectiveness is still not optimal for information extraction."
"To address this issue, we introduce INSTRUCTIE, a bilingual instruction-based information extraction dataset, which covers 12 diverse domains."
"Experimental results demonstrate that large language models trained with INSTRUCTIE can not only obtain better information extraction capabilities but also enhance zero-shot performance compared with baselines."