Core Concepts
Large language models can be effectively applied to biomedical triple extraction, but their performance varies significantly across different datasets. A high-quality biomedical triple extraction dataset with comprehensive relation type coverage is crucial for developing robust triple extraction systems.
Abstract
The paper focuses on benchmarking the performance of various large language models (LLMs) on biomedical triple extraction tasks. It highlights two key challenges in this domain:
- The exploration of applying LLMs to triple extraction is still relatively unexplored.
- The lack of high-quality biomedical triple extraction datasets with comprehensive coverage of relation types impedes progress in developing robust triple extraction systems.
To address these challenges, the authors:
- Conduct a thorough analysis of several LLMs' performance on three biomedical triple extraction datasets: DDI, ChemProt, and the newly introduced GIT (General BioMedical and Complementary and Integrative Health Triples) dataset.
- Introduce the GIT dataset, which is characterized by high-quality annotations and a comprehensive coverage of 22 distinct relation types, surpassing the size and diversity of existing datasets.
The key findings include:
- GPT-3.5/4 exhibits the lowest performance, likely due to the zero-shot setting.
- Despite being trained on the biomedical domain, MedLLaMA 13B still performs worse than LLaMA2 13B.
- The GIT dataset provides a valuable benchmark for biomedical triple extraction, with its extensive relation type coverage and expert annotations.
Stats
The GIT dataset contains 4,691 labeled sentences, which is larger than all other commonly used biomedical triple extraction datasets.
The GIT dataset covers 22 distinct relation types, providing more comprehensive coverage compared to other datasets.
Quotes
"GIT differs from other triple extraction datasets because it includes a broader array of relation types, encompassing 22 distinct types."
"GIT contains 3,734 training instances, 465 testing instances, and 492 validation instances. In GIT, the training, testing, and validation datasets each consist of distinct instances, ensuring there are no duplicates or overlaps between them."