Core Concepts
Large language models struggle with multistep reasoning, prompting the need for a new evaluation dataset like MuSR.
Abstract
Abstract:
Large language models (LLMs) face challenges in robust reasoning in complex settings.
Benchmark datasets for logical deduction tasks have not evolved with the growth of system capabilities.
MuSR introduces a dataset for evaluating LLMs on multistep soft reasoning tasks using natural language narratives.
Introduction:
Evaluating LLMs' reasoning abilities remains challenging due to the limitations of existing benchmarks.
MuSR focuses on tasks involving reasoning based on text narratives, challenging state-of-the-art models.
Data Extraction:
"We introduce MuSR, a dataset for evaluating language models on multistep soft reasoning tasks specified in a natural language narrative."
"Our contributions are as follows: (1) We introduce a new reasoning benchmark, MuSR, consisting of 756 total examples across three domains that challenge state-of-the-art models such as GPT-4, Llama 2, and Vicuna."
Experiments:
Human evaluation shows high performance on MuSR domains compared to LLMs.
Neurosymbolic approaches tailored to specific domains outperform end-to-end models but fall short of human performance.
Stats
大規模言語モデル(LLM)は複雑な状況での堅牢な推論に苦労しており、MuSRのような新しい評価データセットが必要です。
"私たちは、自然言語ナラティブで指定された多段階ソフト推論タスクに関する言語モデルの評価用データセットであるMuSRを紹介します。"
"私たちの貢献は次のとおりです:(1)GPT-4、Llama 2、Vicunaなどの最先端モデルを挑戦する756個の例を含む新しい推論ベンチマークであるMuSRを紹介します。"