This study presents a novel benchmark for evaluating the humanlikeness of large language models (LLMs) in language use. The benchmark consists of 10 psycholinguistic experiments covering key linguistic aspects such as sound, word, syntax, semantics, and discourse.
The researchers collected responses from over 2,000 human participants and compared them to outputs from 20 LLMs, including models from the OpenAI, Meta, and Mistral families. They developed an auto-coding algorithm to reliably extract language use patterns from the responses and quantified humanlikeness based on the similarity between human and LLM response distributions.
The results reveal significant differences in how well LLMs approximate human language use across various linguistic levels. The Llama family of models, particularly Meta-Llama-3.1-70B-Instruct, consistently outperformed the OpenAI and Mistral models in terms of humanlikeness. In contrast, the Mistral models showed a decrease in humanlikeness over time.
The study also highlights specific areas where LLMs diverge from human language patterns, such as in semantic priming and ambiguity resolution tasks. These findings underscore the importance of using psycholinguistic methods to evaluate LLMs, as traditional NLP benchmarks often fail to capture the nuances of human language use.
By introducing a comprehensive psycholinguistic benchmark, this study provides a new framework for assessing the humanlikeness of LLMs and offers critical insights for the continued development of language models that more closely mirror the richness and diversity of human communication.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询