핵심 개념
Evaluating the long-context comprehension abilities of Large Language Models using NovelQA benchmark.
초록
Introduction
Large Language Models (LLMs) advancements.
Importance of long-context understanding.
Data Extraction and Annotation
NovelQA construction from English novels.
Manual annotation process and question types distribution.
Experiments and Results
Evaluation of LLMs on NovelQA.
Challenges faced by models in multi-hop reasoning and detail-oriented questions.
Analysis
Performance analysis by question type and evidence recall results.
Conclusion
Contributions of NovelQA to NLP and computational literary studies.
통계
"Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence."
"NovelQA reveals significant insights into the models’ performance, particularly emphasizing the challenges they face with multi-hop reasoning, detail-oriented questions, and extremely long input with more than 100,000 tokens."
"The most advanced long-context LLMs are capable of processing over 250,000 tokens."
"GPT-4 achieves a 46.88% accuracy rate in a generative setting."
"Models exhibit particular difficulty with questions centered around meaning, relation, span, and times."
인용구
"The disparity is further highlighted by the increasing context window size of LLMs."
"NovelQA addresses the need for assessing extremely long-context understanding."
"These results highlight challenges not only in memory optimization but also in nuanced comprehension."