Core Concepts
This research paper investigates the effectiveness of multimodal integration in variational autoencoders (VAEs) for robot control, using information-theoretic measures to analyze the importance of different sensory modalities and the impact of KL-cost weighting schedules on model performance.
Stats
The latent space of the multimodal VAE has the same dimensionality as the input data, which is 28.
The input data consists of two time steps for each of the five modalities: joint position, vision, touch, sound, and motor commands.
The training process utilizes three types of augmented data points, where parts of the input are "muted" by setting their entries to a fixed value outside the input range.
Four different KL-weighting schedules are used: constant 1, constant 0, dynamic decrease plateau 0, and dynamic decrease plateau 1.
20 different models are trained for each schedule for 80,000 epochs.
Quotes
"Human perception is inherently multimodal. We integrate, for instance, visual, proprioceptive and tactile information into one experience."
"In this work we define four different measures with which this integration can be analyzed in detail. This could be used to identify the primary sense of model, detect whether one or multiple modalities are informative for a certain task or guide the learning of a robot to resemble the phases of human development."
"This work is a first approach to analyzing the multimodal integration in a robot. In future research one might use these measures in order to guide the learning and exploration of a robot."