CLongEval introduces a benchmark with 7 tasks and 7,267 examples to assess long-context LLMs. It focuses on information acquisition and reasoning abilities, providing insights into model performance across various tasks and context lengths.
Developing Large Language Models (LLMs) with robust long-context capabilities has been the recent research focus. The evaluation of these models remains underdeveloped due to a lack of benchmarks. CLongEval addresses this gap by presenting a comprehensive Chinese benchmark for evaluating long-context LLMs.
The benchmark features sufficient data volume, broad applicability, and high quality annotations. It evaluates open-source and commercial models proficient in Chinese across various tasks like Long Story QA, Conversation Memory, Summarization, News Labeling, Typo Detection, Key-Passage Retrieval, and Table Querying.
Results show performance discrepancies between open-source and commercial models across tasks. Moonshot-v1 and GPT-4-Turbo exhibit strong performance in handling long contexts compared to other models. The position of referenced chunks in the context affects model performance differently across tasks.
Overall, CLongEval provides valuable insights into the capabilities of long-context LLMs for practical applications in Chinese language processing.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies