GPT-4 Outperforms Other LLMs on SmartPlay Benchmark, but Significant Gaps Remain Compared to Human Baselines
SmartPlay is a benchmark that evaluates the capabilities of large language models (LLMs) as intelligent agents, covering key abilities such as reasoning, planning, spatial reasoning, and learning from interactions. The results show that while GPT-4 variants outperform other LLMs, there are still significant gaps compared to human performance, especially on more challenging tasks like Tower of Hanoi, Crafter, and Minecraft.