Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
The core message of this work is to introduce a novel benchmark, EgoThink, to comprehensively evaluate the first-person perspective thinking capability of vision-language models (VLMs), and to conduct extensive experiments to assess the performance of popular VLMs on this benchmark.