Core Concepts
This paper introduces a novel method for affordance learning that leverages interactive affinity, utilizing visual and geometric cues to improve the perception and generalization of affordance regions in images.
Stats
In the Seen setting, the proposed method improves 43.87% compared to the best segmentation model, 23.52% compared to the best human pose estimation method, 60.62% compared to the few-shot segmentation approach, 32.61% compared to the multimodal model, and 5.56% compared to the existing affordance learning model (PIANet) based on the KLD metric.
In the Obj Unseen setting, the proposed model outperforms the advanced segmentation model by 14.84%, the best human pose estimation method by 26.82%, the few-shot segmentation network by 36.50%, the multimodal model by 15.66%, and PIANet by 11.56% based on the KLD metric.
In the Aff Unseen setting, the proposed model exceeds the advanced segmentation model by 26.40%, the best human pose estimation method by 38.83%, the few-shot segmentation network by 16.07%, the multimodal model by 7.23%, and PIANet by 6.76% based on the KLD metric.
The CAL dataset consists of 55,047 images from 35 affordance and 61 object categories.
Quotes
"Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities."
"The combinatorial relationship ambiguity means that due to the diversity of human-object interactions, the combination of interactions between the body and the object’s local regions is complex and various, resulting in the model’s difficulty in perceiving the contact regions corresponding to distinct interactions and accurately mining the interactive affinity representations."
"The intra-class correspondence ambiguity refers to the fact that since the same affordance covers multiple classes of objects, there are significant variations in the appearance, views, and scales, which makes the representations of the interactable local regions corresponding to the objects and the relative spatial relationships in the interactive and non-interactive images more inconsistent, resulting in the possible occurrence of negative transfer."