The key insights from this work are:
The authors annotated a dataset of 699 records from 120 randomized controlled trial (RCT) reports, with detailed annotations of the numerical findings associated with specific interventions, comparators, and outcomes (ICO triplets). This dataset is released to support future work in this area.
The authors evaluated the performance of a diverse set of large language models (LLMs), including both massive, closed models and smaller, open-source models, in extracting the numerical data necessary to conduct meta-analyses in a zero-shot setting.
For binary (dichotomous) outcomes, the massive LLMs like GPT-4 performed well, achieving exact match accuracies over 65%. However, for continuous outcomes, even the best-performing LLMs struggled, with GPT-4 achieving only 48.7% exact match accuracy.
Error analysis revealed that LLMs sometimes make mistakes in inferring the type of outcome (binary vs. continuous), extract values from the wrong intervention/comparator groups or time points, and have difficulty performing simple mathematical operations like division to infer total group sizes.
Despite these limitations, the authors demonstrate that modern LLMs can support largely automated meta-analyses, by first extracting the raw numerical data and then using specialized statistical software to compute the necessary summary statistics. This represents a promising step toward fully automated evidence synthesis.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Hye Sun Yun,... at arxiv.org 05-06-2024
https://arxiv.org/pdf/2405.01686.pdfDeeper Inquiries