Основні поняття
LEGENT is an open, scalable platform that enables the integration of Large Language Models (LLMs) and Large Multimodal Models (LMMs) to develop embodied agents capable of complex real-life task performance in physical environments.
Анотація
LEGENT is an open-source platform designed to facilitate the development of embodied agents by integrating Large Language Models (LLMs) and Large Multimodal Models (LMMs). The platform consists of two key components:
-
3D Embodied Environment:
- Provides diverse, realistic, and interactive 3D scenes
- Features human-like agents with egocentric vision, language interaction, and generalizable actions
- Offers a user-friendly interface for researchers to easily integrate LLMs and LMMs
-
Scalable Data Generation Pipeline:
- Employs procedural and language-guided scene generation techniques to create diverse environments
- Utilizes LLMs and controllers to generate optimal agent trajectories and actions for large-scale training data
- Supports the creation of various embodied tasks, including navigation and question answering
The authors demonstrate the potential of LEGENT by training a prototype vision-language-action model using the generated data. The model outperforms GPT-4V, a mainstream LMM, in embodied tasks, showcasing the platform's ability to facilitate the integration of LLMs and LMMs for developing generalizable embodied intelligence.
LEGENT aims to bridge the gap between embodied AI and the extensive knowledge present in LLMs and LMMs, enabling the research community to make progress in this field. The platform is publicly available, and the authors plan to continuously enhance it, including expanding the data generation pipeline, scaling model training, and improving scene generation and agent animation.
Статистика
The prototype model trained on LEGENT-generated data outperforms GPT-4V, a mainstream LMM, in embodied tasks.
Increasing the training data from 1k to 10k trajectories improves the model's performance on the "Come Here" and "Where Is" tasks.
The model's ability to generalize to the untrained "Where Is" task in a two-room setting demonstrates the platform's potential for developing generalizable embodied intelligence.
Цитати
"Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in physical environments."
"To achieve generalizable embodied intelligence, two key factors are crucial: language grounding to utilize the extensive knowledge in LMMs, and the expansion of training data for embodied AI."