핵심 개념
Ada-LEval, a length-adaptable benchmark, is introduced to rigorously evaluate the long-context capabilities of large language models, revealing significant limitations in their performance, especially in ultra-long-context settings.
초록
The paper introduces Ada-LEval, a novel benchmark designed to assess the long-context capabilities of large language models (LLMs). Ada-LEval comprises two challenging tasks:
TSort: Requires LLMs to arrange shuffled text segments from a long document in the correct order, necessitating comprehensive understanding of the full text.
BestAnswer: Asks LLMs to identify the best answer to a question from a large set of candidates, again demanding thorough comprehension of the provided content.
The key advantages of Ada-LEval are:
- Controllable test case length: The length of text segments and number of distractor answers can be adjusted to evaluate LLMs across different context lengths.
- Necessity of full-text comprehension: Successful completion of both tasks requires LLMs to deeply understand the entire text, not just extract superficial information.
- Precise accuracy measurement: The design of the tasks allows for unambiguous evaluation of model performance.
The paper evaluates several state-of-the-art proprietary and open-source LLMs on Ada-LEval. The results reveal significant limitations in the long-context capabilities of existing models, especially in ultra-long-context settings (32,000+ tokens). Even the most powerful proprietary models struggle to maintain performance as text length increases.
The authors also conduct ablation studies to further analyze the shortcomings of current LLMs, including poor instruction following, strong position bias, and limited scalability of position embeddings. These insights provide valuable guidance for future developments in long-context language modeling.
통계
The context window of GPT-4-Turbo is 128,000 tokens, while Claude-2 and Claude-2.1 can handle up to 200,000 tokens.
The average token length of test cases in Ada-LEval ranges from 955 to 126,098 tokens, covering both long-context and ultra-long-context settings.
인용구
"Despite these advancements, three significant limitations persist in existing benchmarks: Firstly, the ultra-long setting (32,000 tokens or longer) is scarcely represented, limiting insights into LLM performance in extreme context lengths. Secondly, the integration of test samples of varying lengths within these benchmarks complicates the evaluation of LLMs across different length ranges. Lastly, the focus on traditional tasks such as question-answering and summarization often does not necessitate comprehensive content understanding by the LLMs, as many questions in these tasks do not require full-text comprehension."