Core Concepts
The core message of this paper is to propose a novel query performance prediction (QPP) framework, called QPP-GenRE, which decomposes QPP into independent subtasks of automatically generating relevance judgments using large language models (LLMs). QPP-GenRE can predict various IR evaluation measures based on the generated relevance judgments, and provides interpretable insights into QPP outputs.
Abstract
The paper proposes a novel QPP framework called QPP-GenRE, which decomposes QPP into two main steps:
Generating relevance judgments using LLMs:
QPP-GenRE employs an LLM, specifically LLaMA, to automatically predict the relevance of each item in the top-n positions of a ranked list for a given query.
To improve the LLM's effectiveness in generating relevance judgments, the authors fine-tune LLaMA using parameter-efficient fine-tuning (PEFT) with human-labeled relevance judgments.
Predicting IR evaluation measures:
QPP-GenRE regards the generated relevance judgments as pseudo-labels to calculate different IR evaluation measures, such as reciprocal rank (RR@10) and normalized discounted cumulative gain (nDCG@10).
For predicting recall-oriented measures like nDCG@10, the authors devise an approximation strategy to avoid the high computational cost of judging the entire corpus.
The experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP performance in estimating the retrieval quality of both lexical (BM25) and neural (ANCE) rankers, for both precision- and recall-oriented IR evaluation metrics. The authors also analyze the impact of the judgment depth in the ranked list on QPP quality, and demonstrate the effectiveness of fine-tuning LLaMA compared to zero-/few-shot prompting.
Stats
The retrieval quality of BM25 in terms of nDCG@10 is 0.506, 0.480, 0.446 and 0.269 on TREC-DL 19, 20, 21 and 22, respectively.
The retrieval quality of ANCE in terms of nDCG@10 is 0.645 and 0.646 on TREC-DL 19 and 20, respectively.