The paper presents a novel approach, AffordanceLLM, that leverages the world knowledge embedded in large-scale vision-language models (VLMs) to enhance the performance of affordance grounding, a fundamental task in computer vision.
The key insights are:
The AffordanceLLM model is built upon a VLM backbone (LLaVA) and extended with a mask decoder to predict affordance maps. It takes an image and a text prompt as input, and leverages the world knowledge in the VLM to generate the affordance map.
The model is evaluated on the challenging AGD20K benchmark, with a focus on testing the generalization capability to novel objects unseen during training. AffordanceLLM significantly outperforms state-of-the-art baselines, demonstrating its superior performance in grounding affordance for in-the-wild objects.
The paper also validates the model's ability to generalize to completely novel objects and actions from random Internet images, showcasing its remarkable flexibility and broad applicability.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Shengyi Qian... alle arxiv.org 04-19-2024
https://arxiv.org/pdf/2401.06341.pdfDomande più approfondite