Bibliographic Information: Takehi, R., Voorhees, E. M., & Sakai, T. (2024). LLM-Assisted Relevance Assessments: When Should We Ask LLMs for Help? In Proceedings of the Conference'17 (Vol. 15, pp. 8). ACM.
Research Objective: This paper investigates how to effectively leverage both human expertise and the efficiency of large language models (LLMs) to create robust and reliable test collections for evaluating information retrieval systems, especially under budget constraints.
Methodology: The researchers developed LARA, an algorithm that strategically combines manual annotations with LLM predictions. LARA identifies the most informative documents for human assessment based on the uncertainty of LLM predictions. It then uses these manual annotations to calibrate and refine the LLM's predictions for the remaining documents. The performance of LARA is compared against various baseline methods, including those solely relying on manual assessments, LLM predictions, and other hybrid approaches.
Key Findings: The experiments, conducted on TREC-COVID and TREC-8 Ad Hoc datasets, demonstrate that LARA consistently outperforms all other methods in accurately ranking information retrieval systems, particularly under limited annotation budgets. The study also found that LARA effectively minimizes errors in LLM annotations by strategically incorporating human judgments.
Main Conclusions: The research concludes that a hybrid approach like LARA offers a practical and effective solution for building high-quality test collections. By balancing the strengths of human assessors and LLMs, LARA allows for the creation of larger, more reliable test collections, ultimately leading to more robust evaluations of information retrieval systems.
Significance: This work significantly contributes to the field of information retrieval evaluation by providing a practical and effective method for building test collections, a crucial aspect of evaluating and improving search engines and other information retrieval systems.
Limitations and Future Research: While the study demonstrates the effectiveness of LARA for binary relevance judgments, future research could explore its applicability to graded relevance assessments. Additionally, investigating the adaptation of LARA to other annotation tasks in information retrieval and related fields like e-Discovery is a promising direction.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Rikiya Takeh... at arxiv.org 11-12-2024
https://arxiv.org/pdf/2411.06877.pdfDeeper Inquiries