Staudinger, M., Piroi, F., & Rauber, A. (2024). Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’24), December 9–12, 2024, Tokyo, Japan. ACM, New York, NY, USA, 6 pages. https://doi.org/10.1145/3673791.3698421
This paper presents a novel hybrid information retrieval (IR) system designed to address the challenge of result reproducibility in evolving document collections, a critical issue in research areas reliant on consistent data subsets for analysis and validation. The authors aim to demonstrate the feasibility of combining a traditional IR system (Lucene) with a versioned column-store database (MonetDB) to achieve both efficient query processing and result reproducibility.
The proposed hybrid system leverages Lucene for fast, ranked retrieval and a VCBR system implemented on MonetDB for storing historical corpus statistics, enabling the recreation of past index states. The system synchronizes document preprocessing and term statistics between Lucene and MonetDB, tracking changes over time. Queries are primarily handled by Lucene, with results stored in a query store alongside metadata and hash keys for reproducibility verification. Re-execution of queries utilizes the VCBR system to retrieve historical corpus statistics and reproduce identical ranked lists.
Evaluation using a subset of the German Wikipedia corpus demonstrated the system's ability to reproduce identical ranked lists while maintaining acceptable performance. Lucene's indexing and query processing times remained relatively stable with increasing corpus size, while MonetDB exhibited linear increases. Despite minor score variations due to floating-point inaccuracies, the system successfully reproduced identical ranked lists, with a hash-based error correction mechanism addressing rare discrepancies.
The hybrid Lucene-VCBR system effectively addresses the reproducibility challenge in evolving corpora, offering a viable solution for research domains requiring result traceability and replicability. The system's ability to recreate past index states enables time-travel search, further enhancing its utility.
This research contributes significantly to the field of information retrieval by providing a practical solution for reproducible ranked retrieval in dynamic data environments. The proposed system has the potential to impact research areas such as systematic literature reviews, patent analysis, and scientific studies relying on evolving data collections.
Future work includes expanding the system's support for additional retrieval models beyond BM25, particularly dense retrieval models. Performance optimization of the VCBR system is crucial for handling larger datasets. Further investigation into the reproducibility of ranked lists using neural reranking techniques is also warranted.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Moritz Staud... at arxiv.org 11-07-2024
https://arxiv.org/pdf/2411.04051.pdfDeeper Inquiries