Wu, X., Wang, M., Liu, Y., Shi, X., Yan, H., Lu, X., ... & Zhang, W. (2024). LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios. arXiv preprint arXiv:2411.07037.
This paper aims to address the lack of benchmarks specifically designed to evaluate the performance and stability of LLMs in understanding and executing instructions within long-context scenarios.
The authors developed LIFBench, a benchmark comprising three long-context scenarios (List, MultiDoc, OneDoc) with eleven diverse tasks simulating real-world LLM applications. They employed an automated instruction expansion method to create variations in length, expression, and variables, resulting in 2,766 instructions. For evaluation, they proposed LIFEval, an automated rubric-based scoring framework that assesses LLM outputs based on predefined criteria and maps scores to six core instruction-following capabilities. Additionally, they introduced Instruction Following Stability (IFS) as a metric to quantify the consistency of LLM performance across different input variations.
LIFBench and LIFEval provide robust tools for assessing LLM performance in complex, long-context settings, offering valuable insights for future LLM development. The study emphasizes the importance of instruction fine-tuning, the need for improved stability in instruction following, and the gap between open-source and closed-source models.
This research contributes significantly to the field of LLM evaluation by introducing a dedicated benchmark and evaluation framework for long-context scenarios. The findings provide valuable guidance for researchers and developers in improving the instruction-following capabilities and stability of LLMs, ultimately leading to more reliable and effective deployment in real-world applications.
The study acknowledges the limitations of using right truncation for models unable to handle the longest inputs, potentially affecting stability assessment. Future research could explore alternative approaches to address this limitation. Additionally, expanding LIFBench with more diverse scenarios and tasks would further enhance its comprehensiveness and value for evaluating LLMs in long-context settings.
翻譯成其他語言
從原文內容
arxiv.org
深入探究