Core Concepts
Comparing the performance of masked language models and generative language models on a natural language inference task for clinical trial data, focusing on metrics of faithfulness and consistency.
Abstract
The paper describes two approaches to address the SemEval 2024 Task 2 on Safe Biomedical Natural Language Inference for Clinical Trials (NLI4CT):
Finetuning and ensembling Masked Language Models (MLMs) such as NLI-RoBERTa, ClinicalBERT, and Clinical-Longformer. The ensemble of these MLMs achieved 0.57 F1 score, 0.64 Faithfulness, and 0.56 Consistency.
Prompting Large Language Models (LLMs) like Flan-T5-large using techniques like Chain-of-Thought and Contrastive Chain-of-Thought in zero-shot, 1-shot, and 2-shot settings. The best LLM system using 2-shot prompting achieved the same performance as the MLM ensemble.
The authors analyze the results in depth, breaking down the performance by gold labels, comparison of clinical trial reports, types of inference, and clinical trial report sections. They find that the MLM approach is more computationally efficient compared to the LLM approach, while achieving similar performance.
The authors also discuss potential future work, including continued pretraining of MLMs on clinical trial data and incorporating domain ontologies to improve the models' performance.
Stats
The NLI4CT dataset consists of 999 clinical trial reports and 2,400 statements.
The average length of a statement is 19.5 tokens, and the average length of an evidence is 10.7 tokens.
The maximum length of a statement is 65 tokens, and the maximum length of an evidence is 197 tokens.
The dataset is balanced, with 50% of instances labeled as Entailment and 50% as Contradiction.
Quotes
"Prompting Flan-T5-large in a 2-shot setting leads to our best system that achieves 0.57 F1 score, 0.64 Faithfulness, and 0.56 Consistency."
"The ensemble of 3 NLI-RoBERTa does not add enough diversity to improve its results."
"The single ClinicalBERT obtains an F1-score of 0.00: we observed that it always predicts the label Contradiction, which causes a precision and recall of 0.00."