The paper presents a novel approach, AffordanceLLM, that leverages the world knowledge embedded in large-scale vision-language models (VLMs) to enhance the performance of affordance grounding, a fundamental task in computer vision.
The key insights are:
The AffordanceLLM model is built upon a VLM backbone (LLaVA) and extended with a mask decoder to predict affordance maps. It takes an image and a text prompt as input, and leverages the world knowledge in the VLM to generate the affordance map.
The model is evaluated on the challenging AGD20K benchmark, with a focus on testing the generalization capability to novel objects unseen during training. AffordanceLLM significantly outperforms state-of-the-art baselines, demonstrating its superior performance in grounding affordance for in-the-wild objects.
The paper also validates the model's ability to generalize to completely novel objects and actions from random Internet images, showcasing its remarkable flexibility and broad applicability.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Shengyi Qian... às arxiv.org 04-19-2024
https://arxiv.org/pdf/2401.06341.pdfPerguntas Mais Profundas