This paper revisits the indexing-time implementation of PSQ and analyzes the tradeoff between retrieval effectiveness and efficiency. The key points are:
The authors implement an efficient indexing-time PSQ in Python and evaluate it on modern, large CLIR test collections using more parallel text than previous work.
The authors revisit the three pruning techniques studied by prior work - Probability Mass Function (PMF) threshold, Cumulative Distribution Function (CDF) threshold, and Top-k filtering. They find that CDF thresholds provide a sub-optimal effectiveness-efficiency tradeoff, while PMF thresholds and Top-k filtering offer Pareto-optimal operating points.
The authors conduct a Pareto analysis to comprehensively study the tradeoff between retrieval effectiveness (measured by R@100 and MAP) and index size (as a proxy for efficiency). They find that using a combination of PMF thresholds and Top-k filtering is sufficient to achieve Pareto-optimal effectiveness-efficiency tradeoffs, without the need for CDF thresholds.
The authors make their Python PSQ implementation and the unpruned translation tables publicly available to facilitate further research.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Eugene Yang,... at arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.18797.pdfDeeper Inquiries