toplogo
Resources
Sign In

ELITR-Bench: A Benchmark for Evaluating Long-Context Language Models on Meeting Assistant Tasks


Core Concepts
ELITR-Bench is a new benchmark for evaluating long-context language models on a practical meeting assistant scenario, featuring transcripts obtained by automatic speech recognition and a set of manually crafted questions.
Abstract
The paper introduces ELITR-Bench, a new benchmark for evaluating long-context language models on a meeting assistant task. The benchmark is built upon meeting transcripts from the ELITR project, which contain noisy, long, and oral language data obtained through automatic speech recognition. The authors augmented these transcripts with 271 manually crafted questions and their ground-truth answers. The authors conducted extensive experiments on ELITR-Bench using 9 recent long-context language models, including both proprietary and open-source models. The results highlight a gap between the performance of proprietary models like GPT-4 and open-source models based on LLaMA-2, especially when questions are asked sequentially within a conversation. The paper also provides a thorough analysis of the authors' GPT-4-based evaluation methodology, including a crowdsourcing study that compares the LLM-based evaluation with human judgments. The findings suggest that while GPT-4's evaluation scores are highly correlated with human judges, its ability to differentiate among more than three score levels may be limited.
Stats
The average number of tokens per meeting transcript is 11,339 for the dev set and 12,562 for the test set.
Quotes
None

Key Insights Distilled From

by Thibaut Thon... at arxiv.org 04-01-2024

https://arxiv.org/pdf/2403.20262.pdf
ELITR-Bench

Deeper Inquiries

How could the benchmark be extended to other real-world scenarios beyond meeting assistants?

The ELITR-Bench benchmark could be extended to other real-world scenarios by incorporating datasets and questions from various domains such as customer service interactions, educational settings, healthcare consultations, legal proceedings, and more. Each scenario would present unique challenges for long-context language models to overcome, allowing for a comprehensive evaluation of their capabilities across different applications. Additionally, the benchmark could include a wider range of languages to assess the models' performance in multilingual contexts.

What are the potential biases or limitations of using proprietary language models like GPT-4 as evaluators, and how could they be addressed?

Using proprietary language models like GPT-4 as evaluators may introduce biases related to the model's training data, architecture, and decision-making processes. These biases could impact the evaluation results and potentially favor the proprietary model over others. To address these biases, it is essential to compare the performance of proprietary models with open-source models on diverse datasets and tasks. Additionally, transparency in the evaluation process, including detailed score rubrics and evaluation criteria, can help mitigate biases and ensure a fair assessment of all models.

How might the performance of long-context language models on ELITR-Bench relate to their ability to handle other types of long, noisy, and conversational data, such as transcripts of interviews or lectures?

The performance of long-context language models on ELITR-Bench provides insights into their ability to handle other types of long, noisy, and conversational data, such as transcripts of interviews or lectures. Models that excel on ELITR-Bench demonstrate proficiency in processing extended contexts, understanding conversational nuances, and generating accurate responses to complex questions. This indicates their potential to perform well on similar tasks involving lengthy and noisy textual data, where contextual understanding and coherence are crucial. However, the specific characteristics of different datasets may require fine-tuning or adjustments to optimize the models' performance for each scenario.
0
Rate this tool:
(178 votes)