CLIP, a large pre-trained vision-language model, implicitly embeds valuable knowledge about how humans interact with objects, enabling zero-shot affordance grounding without the need for explicit supervision.
Affordance grounding can be significantly improved by leveraging the rich world knowledge embedded in large-scale vision-language models, enabling better generalization to novel objects and actions.
INTRA, a novel weakly supervised affordance grounding framework, leverages interaction relationship-guided contrastive learning and text-conditioned affordance map generation to enable flexible and accurate grounding of multiple affordances on a single object, without requiring paired egocentric and exocentric images.