VLMs can be leveraged for robotic manipulation tasks through a point-based affordance representation, as demonstrated by MOKA.
MOKA introduces a novel approach that utilizes Vision-Language Models (VLMs) to solve robotic manipulation tasks specified by free-form language descriptions. By bridging affordance representation with motion generation, MOKA enables effective control of robots in diverse environments.