MANGO: A Benchmark for Evaluating Mapping and Navigation Abilities of Large Language Models
Large language models like ChatGPT and GPT-4 perform poorly on a benchmark that tests their ability to construct maps and navigate through complex text-based environments, suggesting a need to improve their spatial reasoning and mapping capabilities.