Wang, M., Wu, W., Gao, C., Wang, D., Feng, S., & Zhang, Y. (2024). ROCAR: A RELATIONSHIP NETWORK-BASED EVALUATION METHOD FOR LARGE LANGUAGE MODELS. arXiv preprint arXiv:2307.15997v2 [cs.CL] 11 Nov 2024.
This paper introduces RoCar, a new method for evaluating the reasoning and memory capabilities of Large Language Models (LLMs) by using randomly generated social network graphs. The objective is to create a fairer evaluation method that minimizes the risk of LLMs having prior knowledge of the evaluation tasks.
RoCar constructs evaluation tasks based on social network graphs. It first defines a set of 27 basic relationship schemas, each representing a first-order relationship type with attributes like gender, order, and direction. These schemas are used to randomly generate task graphs, simulating social networks. To ensure fairness, surrogate libraries containing names and genders are used to populate the nodes in the task graph. The relationships in the graph are then converted into natural language prompts, which are fed to the LLMs. The LLMs are then asked questions based on the task graph to evaluate their reasoning and memory abilities.
The paper presents the results of applying RoCar to evaluate several open-use LLMs. The findings suggest that RoCar can effectively differentiate between the reasoning and memory capabilities of different LLMs.
The authors conclude that RoCar offers a promising approach to evaluating LLMs in a fairer and more comprehensive manner compared to existing methods. The use of randomly generated social network graphs minimizes the risk of bias due to pre-existing knowledge, and the method allows for the assessment of both reasoning and memory capabilities.
This research contributes to the field of Natural Language Processing by proposing a novel and potentially more robust method for evaluating LLMs. As LLMs become increasingly sophisticated, developing reliable and unbiased evaluation methods is crucial for tracking progress and guiding future research.
The authors acknowledge that the current implementation of RoCar could be further expanded by including a wider range of relationship types and incorporating relationships that reflect real-world complexities beyond simple social connections. Future research could also focus on evaluating more LLMs, conducting multi-group randomized experiments to enhance fairness, and exploring the impact of different prompt types and formats on LLM performance.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询