Grunnleggende konsepter
While large language models (LLMs) show promise for generating Boolean queries in systematic reviews, current research suffers from reproducibility and generalizability issues, highlighting the need for more transparent and robust evaluation methods.
Sammendrag
This research paper investigates the use of large language models (LLMs) for generating Boolean queries in systematic reviews, focusing on the reproducibility and generalizability of existing studies. The authors attempt to reproduce the findings of two recent publications by Wang et al. and Alaniz et al., which explored the use of ChatGPT for this task.
The authors created a pipeline to automatically generate Boolean queries for systematic reviews using various LLM APIs, including GPT models and open-source alternatives like Mistral and Zephyr. They tested these models on the CLEF TAR and Seed datasets, comparing their performance to the baselines reported in the original studies.
The results show that while some LLMs, particularly GPT-3.5 and GPT-4, achieved higher precision scores on the CLEF TAR dataset and better recall scores on the Seed dataset compared to the original studies, they were unable to fully reproduce the reported results. This discrepancy highlights potential issues with transparency and completeness in the original studies' methodologies.
Furthermore, the authors found significant variability in the quality and format of the generated queries, even when using fixed random seeds. This inconsistency raises concerns about the reliability and robustness of using LLMs for this task in real-world settings.
The paper concludes that while LLMs hold potential for automating Boolean query generation in systematic reviews, further research with a strong emphasis on reproducibility, transparency, and rigorous evaluation is crucial before these models can be reliably integrated into the systematic review process.
- Bibliographic Information: Staudinger, M., Kusa, W., Piroi, F., Lipani, A., & Hanbury, A. (2024). A Reproducibility and Generalizability Study of Large Language Models for Query Generation. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’24), December 9–12, 2024, Tokyo, Japan. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3673791.3698432
- Research Objective: To investigate the reproducibility and generalizability of using LLMs for Boolean query generation in systematic reviews, based on the findings of Wang et al. (2023) and Alaniz et al. (2024).
- Methodology: The authors developed a pipeline to generate Boolean queries using various LLM APIs, including GPT models and open-source alternatives. They tested these models on the CLEF TAR and Seed datasets, evaluating their performance using precision, recall, F1-score, and F3-score, and comparing the results to the baselines reported in the original studies.
- Key Findings: The authors were unable to fully reproduce the results of the previous studies, despite achieving higher precision on the CLEF TAR dataset and better recall on the Seed dataset with some LLMs. Significant variability in query quality and format was observed, even with fixed random seeds.
- Main Conclusions: While LLMs show promise for Boolean query generation, current research suffers from reproducibility and generalizability issues. More transparent and robust evaluation methods are needed to ensure the reliability of these models for systematic reviews.
- Significance: This study highlights the importance of rigorous evaluation and transparency in LLM research for information retrieval tasks, particularly in the context of systematic reviews where accuracy and reliability are paramount.
- Limitations and Future Research: The study was limited by the lack of detailed methodological information in the original papers. Future research should focus on developing standardized evaluation frameworks, exploring methods for improving the consistency of LLM-generated queries, and investigating the generalizability of these models across different review topics and domains.
Statistikk
As of July 5th, 2024, the paper by Wang et al. (2023) had been cited 152 times according to Google Scholar and 111 times according to Semantic Scholar within less than a year of its publication.
23% of published systematic reviews need to be updated within two years after completion (Shojania et al., 2007).
Conducting a complete systematic review takes, on average, 67 weeks (Grant & Booth, 2009).
The median time to publication for a systematic review was 2.4 years (Falconer et al., 2016).
Sitater
"Although OpenAI and Mistral AI have extended their APIs to allow the configuration of a random seed, even this beta function does not guarantee a deterministic output and the reproduction of generated outputs."
"The inherent LLM output variability poses a challenge to the reproducibility of systematic reviews, necessitating rigorous validation of LLM-generated queries against expert strategies to ensure reliability and relevance [58]."
"While models, such as Llama [49], Alpaca [47], and Mistral [19] are open-source, their performance in domain-specific tasks and low resource setting is heavily influenced by the original datasets used for their training."