Temel Kavramlar
Permutation self-consistency, a novel decoding technique, can improve the quality, consistency, and position invariance of listwise ranking in black-box large language models.
Özet
The paper proposes a novel decoding technique called "permutation self-consistency" to improve the listwise ranking ability of large language models (LLMs). The key idea is to marginalize out the positional biases in LLMs by repeatedly shuffling the input list, passing it through the LLM, and then aggregating the resulting rankings.
The authors first demonstrate that LLMs exhibit positional biases, especially in the middle of long input lists, which can lead to poor ranking performance. To address this, they introduce permutation self-consistency, which has two main steps:
Construct a diverse set of output rankings by randomly permuting the input list and passing it through the LLM multiple times.
Aggregate these output rankings into a single central ranking that minimizes the sum of Kendall tau distances to all the individual rankings, effectively marginalizing out the positional biases.
The authors provide theoretical guarantees, showing that the Kemeny-Young optimal ranking used in the aggregation step can recover the true ranking under certain noise distributions.
Empirically, the authors evaluate permutation self-consistency on three sorting tasks (math expressions, words, and sentences) and two passage reranking datasets. They consistently observe improvements of up to 34-52% for the Mistral model, 7-18% for GPT-3.5, and 8-16% for LLaMA v2 (70B) compared to conventional inference. The authors also conduct analyses to justify their design choices, such as the number of aggregated rankings and the use of Kemeny ranking over alternative methods.
Overall, the paper introduces a novel and effective technique for improving the listwise ranking capabilities of black-box LLMs, with potential applications in various domains that require high-quality ranking, such as information retrieval, recommendation systems, and question answering.
İstatistikler
The correct output order for the example prompt on shrews is (2, 3, 1), from most to least relevant.
LLMs tend to get "lost in the middle" of long input lists and use the middle portion poorly, leading to misranking.
Prompt order can also affect the quality of LLM outputs, with some orders outperforming others.
Alıntılar
"Large language models (LLMs) exhibit positional bias in how they use context, which especially affects listwise ranking."
"Our key idea is to marginalize out different list orders in the prompt to produce an order-independent ranking with less positional bias."
"Theoretically, we prove the robustness of our method, showing convergence to the true ranking under random perturbations."