The article introduces a methodology combining text-to-SQL generation with retrieval augmented generation (RAG) to answer epidemiological questions using electronic health records (EHR) and claims data. By integrating medical coding into the process, the approach significantly enhances performance over simple prompting. The study shows that while current language models are not yet accurate enough for unsupervised use, RAG offers a promising direction for improving their capabilities in an industry setting. The dataset created through manual curation provides a realistic selection of epidemiological questions within industry practice, showcasing high complexity. Leveraging the OMOP-CDM model helps address data retrieval variability across databases with differing data models. The methodology employs large language models and RAG to translate natural language questions into SQL queries accurately retrieving database information.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Angelo Zilet... kl. arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09226.pdfDybere Forespørgsler