The paper proposes the LLM+A framework to empower large language models (LLMs) for language-conditioned robotic manipulation tasks. LLM+A utilizes pre-trained vision-language models (VLMs) to provide textual observation of the environment and interactive objects to the LLMs.
The key innovation is the affordance prompting technique, which stimulates the LLMs to:
With the affordance information, the LLMs can decompose high-level language instructions into feasible sub-tasks and generate low-level control sequences for the robot. Experiments on various robotic manipulation tasks demonstrate the effectiveness and robustness of the LLM+A framework, outperforming recent LLM-based baselines that rely on pre-defined skills or additional training.
The paper highlights the potential of leveraging the commonsense knowledge and reasoning capabilities of LLMs to address robotics challenges in a training-free paradigm, mitigating the dataset bottleneck issue. Future work will focus on further optimizing the efficiency and extending the LLM+A approach to a broader range of robotic tasks.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問