Evaluating Large Language Models' Capabilities for Safe Biomedical Natural Language Inference on Clinical Trial Reports
Large language models (LLMs) can achieve strong performance on natural language inference tasks in the biomedical domain, but they still face challenges in maintaining consistency, faithfulness, and robust reasoning, especially when dealing with numerical and logical reasoning on clinical trial reports.