Core Concepts
Incorporating relevant concept definitions on-the-fly can significantly improve the performance of large language models on biomedical named entity recognition tasks, especially in limited data settings.
Abstract
This paper presents a comprehensive exploration of prompting strategies for using large language models (LLMs) on biomedical named entity recognition (NER) tasks. The authors first establish baseline performance of various SOTA LLMs, including GPT-3.5, GPT-4, Claude 2, and Llama 2, in both zero-shot and few-shot settings across 6 diverse biomedical NER datasets.
The key contribution is a new knowledge augmentation approach that incorporates definitions of relevant biomedical concepts dynamically during inference. The authors explore two prompting strategies for this:
Single-turn: A single follow-up prompt that asks the model to make corrections to all extracted entities based on provided definitions.
Iterative prompting: A sequence of prompts, each asking the model to correct a single extracted entity based on its definition.
The results show that definition augmentation leads to consistent and significant performance improvements across the LLMs, with an average gain of 15% in GPT-4 performance. Ablation studies confirm that these gains are due to the relevance of the provided definitions, rather than just additional context.
The authors also explore using definitions from different sources (UMLS, Wikidata, LLM-generated) and find that human-curated definitions from UMLS lead to the highest performance improvements. Overall, this work demonstrates the value of dynamically incorporating relevant knowledge to enhance LLM performance on specialized tasks like biomedical NER.
Stats
There are several common polymorphisms in the BRCA1 gene which generate amino acid substitutions.
BRCA1 gene: A tumor suppressor gene (GENES, TUMOR SUPPRESSOR) that is a component of DNA repair pathways.
Quotes
"Despite their general capabilities, LLMs still struggle on biomedical NER tasks, which are difficult due to the presence of specialized terminology and lack of training data."
"Our experiments show that definition augmentation is useful for both open source and closed LLMs. For example, it leads to a relative improvement of 15% (on average) in GPT-4 performance (F1) across all (six) of our test datasets."