XL2Bench is a comprehensive benchmark for evaluating large language models' (LLMs) ability to understand and process extremely long texts with long-range dependencies. It consists of three scenarios - Fiction Reading, Paper Reading, and Law Reading - and four tasks of increasing complexity: Memory Retrieval, Detailed Understanding, Overall Understanding, and Open-ended Generation.
The benchmark includes a total of 27 subtasks and covers an average text length of over 100K words (English) and 200K characters (Chinese). To construct the benchmark cost-effectively, the authors employ a combination of content extraction, data integration, and data synthesis techniques, leveraging large language models. They also implement data augmentation strategies to mitigate the impact of data contamination.
Experiments on six leading LLMs reveal that their performance significantly lags behind human levels, with a marked decline in performance as the text length increases. The results also show that retrieval-based methods fail in overall and detailed understanding tasks, as they require the models to comprehensively grasp the entirety of the long texts. The authors' ablation experiments demonstrate the effectiveness of their data augmentation techniques in addressing data contamination concerns.
Overall, XL2Bench provides a valuable resource for advancing research in the comprehension of long texts, highlighting the current limitations of LLMs and the need for further advancements in long-context understanding.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies