The paper explores the capabilities of large language models (LLMs) such as Gemini Pro, GPT-3.5, and Flan-T5 in performing safe biomedical natural language inference (NLI) on clinical trial reports (CTRs) for breast cancer. The task, part of SemEval 2024 Task 2, involves determining the inference relation (entailment or contradiction) between CTR-statement pairs.
The key highlights and insights are:
The authors experiment with various pre-trained language models (PLMs) and LLMs, including BioLinkBERT, SciBERT, ClinicalBERT, and ClinicalTrialBioBERT-NLI4CT, in addition to Gemini Pro and GPT-3.5.
They integrate the Tree of Thoughts (ToT) and Chain-of-Thought (CoT) reasoning frameworks into the Gemini Pro and GPT-3.5 models to improve their reasoning capabilities.
Gemini Pro emerges as the top-performing model, achieving an F1 score of 0.69, a consistency score of 0.71, and a faithfulness score of 0.90 on the official test dataset.
The authors conduct a comparative analysis between Gemini Pro and GPT-3.5, highlighting GPT-3.5's limitations in numerical reasoning tasks compared to Gemini Pro.
The paper emphasizes the importance of prompt engineering for LLMs to enhance their performance on the NLI4CT task.
The authors make their instruction templates and code publicly available to facilitate reproducibility.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Shreyasi Man... às arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04510.pdfPerguntas Mais Profundas