The study investigates whether scaling in terms of model size (number of parameters) can bridge the gap between the language abilities of Large Language Models (LLMs) and humans. Three LLMs - Bard (137 billion parameters), ChatGPT-3.5 (175 billion), and ChatGPT-4 (1.5 trillion) - were tested on a grammaticality judgment task featuring four linguistic phenomena: anaphora, center embedding, comparatives, and negative polarity.
The results show that while the largest model, ChatGPT-4, outperforms the smaller models in accuracy, it still exhibits significant differences compared to human performance:
Accuracy: ChatGPT-4 is more accurate than humans for grammatical sentences, but less accurate for ungrammatical sentences. Its accuracy also decreases for ungrammatical sentences upon repeated exposure.
Stability: ChatGPT-4's responses are less stable than humans, with more fluctuations and deviations from the preferred response, especially for ungrammatical sentences.
Repetition effects: While some models show improvements in accuracy with repeated exposure, this is not a consistent pattern across conditions, suggesting that scaling alone does not guarantee convergence towards human-like performance.
The authors argue that these differences between LLMs and humans stem from fundamental differences in the language learning process in silico vs. in vivo. Specifically, LLMs have access to negative evidence (information about ungrammatical constructions) during training, unlike humans, and rely on massive amounts of data rather than the limited input available to children. Additionally, LLMs lack the ability to form hypotheses about language and develop an understanding of linguistic reference, leading to semantic hallucinations. These factors contribute to the persistent gap between LLM and human language abilities, even with significant scaling.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы