Основные понятия
Affordance grounding can be significantly improved by leveraging the rich world knowledge embedded in large-scale vision-language models, enabling better generalization to novel objects and actions.
Аннотация
The paper presents a novel approach, AffordanceLLM, that leverages the world knowledge embedded in large-scale vision-language models (VLMs) to enhance the performance of affordance grounding, a fundamental task in computer vision.
The key insights are:
- VLMs trained on large-scale text data possess rich world knowledge that can be beneficial for affordance reasoning, which goes beyond the limited supervision from training images.
- Incorporating 3D geometric information, such as depth maps, can further improve the model's understanding of object functionality and affordance.
The AffordanceLLM model is built upon a VLM backbone (LLaVA) and extended with a mask decoder to predict affordance maps. It takes an image and a text prompt as input, and leverages the world knowledge in the VLM to generate the affordance map.
The model is evaluated on the challenging AGD20K benchmark, with a focus on testing the generalization capability to novel objects unseen during training. AffordanceLLM significantly outperforms state-of-the-art baselines, demonstrating its superior performance in grounding affordance for in-the-wild objects.
The paper also validates the model's ability to generalize to completely novel objects and actions from random Internet images, showcasing its remarkable flexibility and broad applicability.
Статистика
"To ride the motorcycle, you should interact with the handlebars, which are located at the front of the motorcycle. The handlebars are used to steer the motorcycle and control its direction and speed."
"Additionally, you should also ensure that the motorcycle is parked in a safe and legal location, and that you have the necessary safety gear, such as a helmet and protective clothing, before attempting to ride it."
Цитаты
"With large-scale text pretraining, modern VLMs such as GPT-4, LLaVA and Blip-2 have a rich reservoir of world knowledge, as demonstrated by their extraordinary capabilities in answering visually grounded common sense questions."
"Beside world knowledge, another novel factor we introduce to improve affordance reasoning is 3D geometry, as it holds rich information of object functionality."