Core Concepts
BIRCO is a benchmark for evaluating the performance of information retrieval systems on tasks with complex, multi-faceted user objectives that go beyond simple semantic similarity.
Abstract
The BIRCO benchmark consists of 5 datasets from diverse domains, including computer science, debate, literature, and biomedicine. Each dataset contains paragraph-length queries with multiple facets or requirements that relevant passages must satisfy.
The key highlights of the BIRCO benchmark are:
Complex Query Objectives: The queries in BIRCO require retrieving passages that match multiple criteria, going beyond simple semantic similarity. For example, a query may ask for papers that refute a specific scientific claim, using a certain set of measurements in a specific population.
Diverse Domains: The 5 datasets in BIRCO cover a range of domains, including AI/ML, debate, literature, and biomedicine, testing the generalization capabilities of retrieval models.
Decontamination: The authors carefully filtered out queries that could be answered by large language models (LLMs) without accessing the candidate passages, to ensure the benchmark evaluates the models' true retrieval capabilities.
Compact Size: Each query in BIRCO has a candidate pool of only 50-100 passages, making it feasible to evaluate computationally expensive LLM-based retrieval systems.
Challenging Baselines: Experiments show that even state-of-the-art retrieval models, including fine-tuned language models and LLMs, struggle to achieve satisfactory performance on BIRCO, suggesting the need for stronger models and new retrieval protocols to address complex user needs.
Stats
"We curate 5 open-source datasets (DORIS-MAE, ArguAna, WhatsThatBook, Clinical-Trial, and RELIC), which contain paragraph-length queries with multi-faceted task objectives."
"BIRCO also remains challenging for LLMs despite having only 50-100 documents per query, making it low-cost to evaluate LLM performance."
Quotes
"BIRCO evaluates the ability of IR systems to retrieve documents given multi-faceted user objectives."
"No approach achieves satisfactory performance on all benchmark tasks, suggesting that stronger models and new retrieval protocols are necessary to address complex user needs."