Bibliographic Information: Fujisawa, I., Nobe, S., Seto, H., Onda, R., Uchida, Y., Ikoma, H., Chien, P., & Kanai, R. (2024). ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure. arXiv preprint arXiv:2410.03117.
Research Objective: This paper introduces ProcBench, a benchmark designed to assess the capability of large language models (LLMs) to accurately follow explicit multi-step instructions, a crucial aspect of reasoning often overlooked in standard evaluations.
Methodology: ProcBench consists of 23 tasks designed to minimize the reliance on implicit knowledge, focusing on procedural reasoning. Each task involves simple manipulations of strings, lists, or integers, with complexity increasing as the number of steps grows. Seven state-of-the-art LLMs were evaluated on ProcBench using metrics like Prefix Accuracy (PA), Sequential Match (SM), and Final Match (FM) to analyze their performance across varying problem lengths and task types.
Key Findings: The study found that while top-performing models like o1-preview and o1-mini demonstrated high accuracy on tasks with shorter step sequences, their performance significantly declined as the number of steps increased. This suggests that current LLMs, despite their proficiency in knowledge-based reasoning, struggle with complex procedural reasoning tasks that require strict adherence to multi-step instructions.
Main Conclusions: ProcBench highlights a critical limitation in current LLMs: their difficulty in consistently following detailed procedural instructions, especially in tasks involving longer sequences of steps. This underscores the need for further research and development to enhance the instruction-following capabilities of LLMs, which is crucial for improving their performance in complex problem-solving scenarios.
Significance: This research significantly contributes to the field of LLM evaluation by introducing a novel benchmark that specifically targets procedural reasoning, an area often overshadowed in traditional evaluations. ProcBench provides valuable insights into the strengths and weaknesses of current LLMs, paving the way for developing more robust and reliable language models capable of handling complex, multi-step reasoning across diverse domains.
Limitations and Future Research: The study acknowledges the inherent limitations in completely eliminating implicit knowledge requirements in benchmark design. Future research could explore expanding ProcBench to encompass a wider range of tasks and investigate how explicit instruction-following capabilities can be better integrated into LLM training on traditional benchmarks.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Ippei Fujisa... om arxiv.org 10-07-2024
https://arxiv.org/pdf/2410.03117.pdfDiepere vragen