MessIRve is a new large-scale dataset for Spanish information retrieval (IR) that addresses the lack of comprehensive Spanish IR benchmarks. The dataset was constructed by using queries from Google's autocomplete API, which reflect the information needs of Spanish speakers across 20 countries, as well as queries not specific to any country. The relevant documents were sourced from Wikipedia paragraphs identified as featured snippets in Google Search results.
The dataset is partitioned into training and test sets, with the test set designed to have minimal overlap in topics with the training set. A quality assessment showed that the queries are generally clear and unambiguous, and the annotated relevant documents are likely to contain information that helps answer the queries.
Compared to existing Spanish IR datasets, MessIRve is substantially larger, covering a wider variety of topics. It also explicitly accounts for the diverse dialects of Spanish spoken across different countries, unlike other datasets that either do not consider dialectal variations or lack clear information about their inclusion.
Baseline evaluations of prominent IR models, including BM25, MIRACL-mdpr-es, E5-large, and OpenAI-large, show that the larger dense retrieval models outperform the smaller and lexical-based models. However, the performance varies across the different country-specific subsets of the dataset, highlighting the need for further research to address the challenges of IR in diverse linguistic contexts.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Fran... kl. arxiv.org 09-11-2024
https://arxiv.org/pdf/2409.05994.pdfDybere Forespørgsler