The paper presents an Embodied Multi-Modal Agent (EMMA) that can take a textual task instruction and pixel observations of a visual environment to generate a sequence of actions for efficient task completion.
The key insights are:
EMMA is built upon a modularized Vision-Language Model (VLM) architecture that integrates a pretrained Vision Transformer and Language Model Decoder. This allows EMMA to leverage existing powerful VLM models in a flexible and computationally-efficient way.
To overcome the challenges of training EMMA in the complex visual environment, such as sparse rewards and distribution shift, the authors leverage an LLM expert from a parallel TextWorld. The LLM expert provides EMMA with step-by-step guidance through cross-modality interactive imitation learning.
The LLM expert is composed of an actor that generates actions and a critic that provides retrospective feedback on EMMA's historical trajectories. This retrospective process enables the LLM expert to progressively improve its performance and provide better teaching signals to EMMA.
Extensive evaluations on the ALFWorld benchmark show that EMMA substantially outperforms state-of-the-art VLM-based agents in visual environments, achieving 20%-70% higher success rates. EMMA also exhibits strong robustness to noisy observations compared to LLM-based agents.
Furthermore, EMMA demonstrates powerful generalization to open-vocabulary and free-form task instructions, highlighting its potential in real-world scenarios.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yijun Yang,T... at arxiv.org 04-01-2024
https://arxiv.org/pdf/2311.16714.pdfDeeper Inquiries