CLongEval introduces a benchmark with 7 tasks and 7,267 examples to assess long-context LLMs in Chinese. It addresses the lack of robust evaluation benchmarks for models with extended context capabilities. The benchmark includes tasks focusing on information acquisition and reasoning abilities. CLongEval evaluates 8 LLMs, highlighting performance discrepancies between open-source and commercial models across various tasks.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Zexuan Qiu,J... klo arxiv.org 03-07-2024
https://arxiv.org/pdf/2403.03514.pdfSyvällisempiä Kysymyksiä