Core Concepts
Multimodal Variational Autoencoders (VAEs) can effectively map and integrate visual, language, and action modalities to enable unsupervised learning of robotic manipulation tasks.
Abstract
The paper explores the potential of state-of-the-art multimodal VAE models in the context of robotic manipulation tasks, where actions are learned from a combination of action demonstrations, images, and natural language instructions.
Key highlights:
The authors employ three selected SOTA multimodal VAE models (MVAE, MMVAE, MoPoE) and adapt the encoder-decoder architecture to map natural language inputs, images, and whole motion trajectories.
They propose a model-independent training objective adjustment using the σ-VAE loss, which improves the performance of the implemented models by up to 55% compared to the standard mean squared error loss.
The models are trained and evaluated on 34 synthetic robotic datasets with varying complexity in terms of the number of tasks, distractors, position variability, and task length.
The MVAE model demonstrates the most robust performance, outperforming MMVAE and MoPoE across most scenarios. However, all models face challenges in mapping pixel-level information to precise Cartesian positions, especially in the presence of distractors.
The authors also find that task length has a more significant impact on performance than position variability, suggesting the need for modular approaches that can handle long-horizon tasks.
Overall, the paper provides valuable insights into the capabilities and limitations of SOTA multimodal VAEs for unsupervised learning of robotic manipulation from vision, language, and action.
Stats
The robot is controlled using the x, y and z end-effector positions and an additional binary value g for gripper opening and closing.
The number of timesteps in the motion trajectories varied from 13 to 68 across the different tasks.
Quotes
"Our primary objective is to understand the challenges and requirements for leveraging VAEs in such a multimodal setting where each modality has a different level of abstraction and complexity (e.g., high-level language instruction versus low-level end-effector trajectory) and there is no additional supervision such as ground truth of the object positions."
"We train the models on 34 synthetic robotic datasets with variable complexity in terms of the number of tasks, distractors, position variability and task length. We then evaluate which aspects of the task are the most challenging for the multimodal VAEs."