The paper investigates how different versions of autoregressive language models, including GPT-2, GPT-3/3.5, Llama 2, and GPT-4, handle scope ambiguous sentences and compares their performance to human judgments.
The key highlights and insights are:
Experiment 1 shows that more advanced language models like GPT-3.5, Llama 2 at 70B, and GPT-4 can exhibit similar scope reading preferences as humans, with a high level of accuracy. Smaller or less advanced models, however, struggle.
Experiment 2 suggests that a wide range of language models are sensitive to the meaning ambiguity in scope ambiguous sentences, as evidenced by positive mean α-scores and significant correlations between model α-scores and human proxy scores.
The results indicate that language models can capture different semantic structures corresponding to surface and inverse scope readings, and can also integrate background world knowledge when disambiguating scope ambiguous constructions.
Llama 2 chat models generally perform better than their vanilla counterparts, suggesting that fine-tuning on human feedback may improve a model's ability to handle scope ambiguities.
The expanded datasets used in the follow-up experiments confirm the generalizability of the findings.
翻譯成其他語言
從原文內容
arxiv.org
深入探究