Kernkonzepte
Integrating large language models like GPT-4 with structured knowledge bases, such as CEDAR templates, can significantly improve adherence of metadata to community standards in biomedical datasets.
Zusammenfassung
The paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. The authors conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards.
The key findings are:
- When used alone, GPT-4 achieved a marginal average improvement in adherence to the standard data dictionary from 79% to 80%.
- When prompted with domain information in the form of the textual descriptions of CEDAR templates, GPT-4 recorded a significant improvement in adherence to 97% from 79%.
- These results indicate that while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base.
- The authors provide a route by which the biomedical community can programmatically make its vast archive of online datasets more FAIR, enabling secondary analyses of data and the possibility of making new discoveries.
Statistiken
The average adherence accuracy of the three field names ('tissue', 'disease' and 'cell type') improved from 40% in the original BioSample records to 77% when using GPT-4 with the CEDAR templates.
The average adherence accuracy of the samples improved from 79% in the original BioSample records to 97% when using GPT-4 with the CEDAR templates.
The average error count per sample decreased from 1.64 in the original BioSample records to 0.85 when using GPT-4 with the CEDAR templates.
Zitate
"Our findings indicate that relying solely on GPT-4 may not adequately ensure adherence to these standards. However, by integrating the BioSample metadata template from CEDAR, which includes a comprehensive list of permissible field names and allowed value ranges, we can effectively harness the potential of LLMs."
"The gains were especially visible for the field name "cell type" as the field name can be quite ambiguous without the domain information provided about ontological restrictions."
"The average correctness increases from 79 percent to 97 percent (p<0.01) and the average error reduces from 1.64 per record to 0.85 per record (p<0.01). This result is especially interesting, as the LLM+CEDAR version, on average, has more field names than the original BioSample record."