Core Concepts
ELITR-Bench is a new benchmark for evaluating long-context language models on a practical meeting assistant scenario, featuring transcripts obtained by automatic speech recognition and a set of manually crafted questions.
Abstract
The paper introduces ELITR-Bench, a new benchmark for evaluating long-context language models on a meeting assistant task. The benchmark is built upon meeting transcripts from the ELITR project, which contain noisy, long, and oral language data obtained through automatic speech recognition. The authors augmented these transcripts with 271 manually crafted questions and their ground-truth answers.
The authors conducted extensive experiments on ELITR-Bench using 9 recent long-context language models, including both proprietary and open-source models. The results highlight a gap between the performance of proprietary models like GPT-4 and open-source models based on LLaMA-2, especially when questions are asked sequentially within a conversation.
The paper also provides a thorough analysis of the authors' GPT-4-based evaluation methodology, including a crowdsourcing study that compares the LLM-based evaluation with human judgments. The findings suggest that while GPT-4's evaluation scores are highly correlated with human judges, its ability to differentiate among more than three score levels may be limited.
Stats
The average number of tokens per meeting transcript is 11,339 for the dev set and 12,562 for the test set.