The authors evaluate the language understanding capabilities of large language models (LLMs) on simple inference tasks that most humans find trivial. Specifically, they target:
The authors design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, with multiple prompts and LLMs. The results show that the models exhibit moderate to low performance on these evaluation sets.
Further experiments reveal that embedding the premise in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives) further confuses the models, causing them to either under-predict or over-predict certain entailment labels regardless of the true relation, and often disregarding the nature of the embedding context.
Overall, the results suggest that despite LLMs' celebrated language understanding capacity, even the strongest models have blindspots with respect to certain types of entailments, and certain information-packaging structures act as "blinds" overshadowing the semantics of the embedded premise.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問