Core Concepts
Large Language Models (LLMs) performance in Multi-hop Question Answering tasks is evaluated through a new benchmark, MRKE, highlighting the need for trustworthy evaluation of reasoning abilities.
Stats
"GPT-4 only gets 36.3% right reasoning chain."
"GPT-4 gets 69.3 EM and 82.2 F1 scores on the original HotpotQA dataset."
Quotes
"We believe this new Multi-hop QA evaluation benchmark and novel evaluation methods will facilitate the development of trustworthy LLM evaluation on the MHQA task."