Core Concepts
The SIMA project aims to develop an instructable agent that can accomplish a wide range of tasks in any simulated 3D environment by following open-ended natural language instructions.
Abstract
The SIMA (Scalable, Instructable, Multiworld Agent) project focuses on training embodied AI agents that can follow arbitrary language instructions to act in a diverse range of virtual 3D environments, including both curated research environments and open-ended commercial video games.
The key aspects of the SIMA approach include:
Using a diverse portfolio of over 10 simulated 3D environments, including both research environments (e.g., Playhouse, WorldLab, Construction Lab) and commercial video games (e.g., Goat Simulator 3, No Man's Sky, Valheim).
Training agents to use a generic, human-like interface of pixel observations and keyboard-and-mouse actions, rather than specialized APIs or action spaces.
Focusing on following open-ended natural language instructions, rather than just maximizing game scores or generating plausible behavior.
Leveraging a combination of pretrained models (e.g., for image-text alignment, video prediction) and from-scratch training to enable efficient learning.
Developing a range of evaluation methods, including ground-truth assessments in research environments, optical character recognition in commercial games, and human evaluation, to measure agent performance across diverse skills.
The results show that the SIMA agent can perform a variety of language-instructed tasks across multiple environments, with performance varying based on the complexity of the environment and the specific skill required. The project highlights the challenges and opportunities in bridging the gap between language and grounded embodied behavior at scale.
Stats
"Building embodied AI systems that can follow arbitrary language instructions in any 3D environment is a key challenge for creating general AI."
"Our agents interact with environments in real-time using a generic, human-like interface: the inputs are image observations and language instructions and the outputs are keyboard-and-mouse actions."
"We have a portfolio of over ten 3D environments, consisting of research environments and commercial video games."
Quotes
"Language is most useful in the abstractions it conveys about the world. Language abstractions can enable efficient learning and generalization."
"Bridging the divide between the symbols of language and their external referents is a core challenge for developing general embodied AI."
"Drawing inspiration from the lesson of large language models that training on a broad distribution of data is the most effective way to make progress in general AI."