Buoso, D., Robinson, L., Averta, G., Torr, P., Franzmeyer, T., & De Martini, D. (2024). Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval. arXiv preprint arXiv:2411.04006.
This paper introduces Select2Plan (S2P), a novel framework for robot planning that leverages pre-trained vision-language models (VLMs) and in-context learning (ICL) to enable robots to navigate in both first-person view (FPV) and third-person view (TPV) scenarios without requiring extensive task-specific training.
S2P formulates the planning problem as a visual question answering (VQA) task, where the VLM is prompted to select the next robot action from a set of visually annotated candidates in the image. The framework utilizes an experiential memory of annotated images and corresponding human-like explanations to provide context through ICL. A sampler retrieves relevant experiences based on the current scene, and a prompt templating engine combines this information with the live image and task instructions to query the VLM. In the FPV setting, an episodic memory provides additional context about the robot's past actions and the environment's layout.
S2P demonstrates the potential of ICL-based frameworks combined with VLMs for autonomous navigation, achieving comparable performance to extensively trained models with minimal data and no specialized training. The framework's adaptability to diverse contexts and ability to generalize to novel situations make it a promising approach for real-world robotic applications.
This research contributes to the field of robot navigation by presenting a novel, training-free approach that leverages the power of pre-trained VLMs and ICL. The framework's flexibility and efficiency in utilizing diverse context sources have significant implications for developing scalable and adaptable autonomous systems.
While S2P shows promising results, future research could explore incorporating more sophisticated scene understanding and reasoning capabilities into the framework. Additionally, investigating the impact of different VLM architectures and ICL techniques on performance could further enhance the system's capabilities.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Davide Buoso... at arxiv.org 11-07-2024
https://arxiv.org/pdf/2411.04006.pdfDeeper Inquiries