核心概念
Vision Language Models (VLMs) exhibit high performance on intentionality understanding tasks but struggle with perspective-taking, challenging the common belief that perspective-taking is necessary for intentionality understanding.
摘要
The study investigates the abilities of Vision Language Models (VLMs) in intentionality understanding and perspective-taking using the IntentBench and PerspectBench datasets from the CogDevelop2K benchmark.
Key findings:
- VLMs generally exhibit proficient performance on IntentBench, indicating that they have developed a certain degree of intentionality understanding abilities for action understanding within the visual domain.
- However, VLMs' performance on PerspectBench reveals that they are generally not capable of level-2 perspective-taking, consistently failing to correctly infer what can be seen from a doll's perspective.
- This challenges the common understanding in cognitive science that intentionality understanding is grounded in perspective-taking abilities, particularly in the visual modality.
The authors propose two potential interpretations for this surprising finding:
- VLMs may be able to infer intentions without attempting to take the perspectives of the actors, relying instead on contextual cues and associative learning.
- The Three Mountain Task used in PerspectBench may require cognitive abilities beyond just level-2 perspective-taking, such as the simultaneous confrontation of perspectives in visual reasoning.
The study highlights the need for more contrasting experiments to dissociate the contributions of different pathways of action understanding and perspective-taking in VLMs.
统计
"Intentionality is the capacity of the mind to be directed toward, represent, or stand for objects, properties, or states of affairs for further executable actions."
"Theory-of-mind is commonly understood to be grounded in perspective-taking, the ability to cognitively undertake the perspective of another."
"4-year-olds consistently fail on Three Mountain Task. This changes markedly as children enter the concrete operational stage. Children around age 6 could recognize perspectives different from their own. By ages 7–8, they could consistently and successfully identify the perspective of the other person."
引用
"To truly understand intentional meaning, theory-of-mind—the ability to simulate the mental content of others is required."
"It is thus argued that perspective-taking grounds intentionality understanding in theory-of-mind."