Core Concepts
Large Language Models face challenges in long-context comprehension, addressed by NovelQA.
Abstract
Abstract:
Introduction of NovelQA as a benchmark for evaluating Large Language Models (LLMs) on extended texts from English novels.
Highlights the challenges faced by LLMs in understanding long-context information and the need for advancements.
Data Extraction and Annotation Process:
Constructed from English novels to test LLM capabilities with extended texts.
Manual annotation process by skilled annotators with degrees in English Literature.
Evaluation Results:
Significant insights into LLM performance on NovelQA, emphasizing challenges with multi-hop reasoning and detailed questions.
Commercial models like GPT-4 outperform open-source models in generative and multichoice settings.
Related Work:
Comparison with existing benchmarks like ZeroSCROLLS, LooGLE, and LongBench highlighting the importance of understanding long texts.
Experiments:
Evaluation of various long-context LLMs including GPT-4, Claude 2.1, InternLM2 on NovelQA.
Analysis:
Performance analysis based on question types reveals weaknesses in narrative comprehension and abstract concept interpretation.
Conclusion:
NovelQA contributes to advancing research in NLP and computational literary studies by challenging LLMs with complex real-world texts.
Stats
NovelQAは、LLMの性能を評価するために構築されました。
NovelQAは、英語の小説から構築され、LLMの能力をテストします。
GPT-4などの商用モデルは、generativeおよびmultichoice設定で優れたスコアを達成しました。