Core Concepts
Large language models can generate relevance labels for search results that are as accurate as or more accurate than labels provided by human assessors, and can be used to efficiently scale relevance labeling for information retrieval system evaluation.
Abstract
The paper discusses an alternative approach to obtaining high-quality relevance labels for evaluating information retrieval systems. Traditionally, relevance labels are obtained from human assessors, which can be costly, time-consuming, and prone to biases and errors. The authors propose using large language models (LLMs) to generate relevance labels that match real searcher preferences.
The key highlights and insights are:
The authors conducted experiments using the TREC-Robust dataset, comparing relevance labels generated by LLMs to those provided by human TREC assessors. They found that with careful prompt engineering, LLMs can achieve label quality comparable to or better than human assessors, as measured by agreement with the human labels.
The authors also evaluated the impact of the LLM-generated labels on query and system ranking, and found that the rankings derived from LLM labels were highly consistent with those derived from human labels.
The authors argue that LLMs offer several advantages over human labelers, including higher accuracy, faster throughput, lower cost, and better scalability. They report successful deployment of LLM-based relevance labeling at Bing, where the LLM labels outperformed both crowd workers and in-house experts.
The authors note that LLM performance is highly sensitive to prompt wording, and emphasize the importance of carefully selecting and validating the prompt against a high-quality ground truth dataset of real searcher preferences.
Stats
"LLMs are as accurate as human labellers and as useful for finding the best systems and hardest queries."
"LLM performance varies with prompt features, but also varies unpredictably with simple paraphrases."
Quotes
"LLMs can do better on this metric than any population of human labellers that we study."
"Our experiments show LLMs are as accurate as human labellers and as useful for finding the best systems and hardest queries."