核心概念
A neural model that bidirectionally binds robot actions and their language descriptions using paired variational autoencoders and a pretrained language model (BERT) to enable understanding of unconstrained natural language instructions.
摘要
The paper presents a neural model that bidirectionally maps robot actions and their language descriptions. The model consists of two main components:
-
Paired Variational Autoencoders (PVAE):
- The PVAE model has two variational autoencoders - one for language and one for actions.
- The language VAE reconstructs descriptions, while the action VAE reconstructs joint angle values conditioned on visual features.
- The two VAEs are implicitly bound together through a binding loss, enabling bidirectional translation between actions and language.
- The PVAE model can map each robot action to multiple description alternatives, transcending the strict one-to-one mapping.
- Experiments show the superiority of the PVAE over standard autoencoders and the advantage of using channel-separated visual feature extraction.
-
PVAE-BERT:
- To enable the model to understand unconstrained natural language instructions, the authors replace the LSTM language encoder with a pretrained BERT model.
- PVAE-BERT can recognize different commands that correspond to the same actions, handling variations in word order, filler words, etc.
- Experiments demonstrate that PVAE-BERT achieves comparable performance to the original PVAE in action-to-language translation, while also handling a wider range of language input.
- Principal component analysis on the hidden representations shows that the model learns the compositionality of language and the semantic similarity between actions and their descriptions.
The proposed approach combines the strengths of variational autoencoders and pretrained language models to enable robust and flexible robotic language learning.
統計資料
"The dataset that the model is trained on consists of pairs of simple robot actions and their textual descriptions, e.g., 'pushing away the blue cube'."
"Every sentence is composed of three words (excluding the <BOS/EOS> tags which indicate the beginning or end of the sentence) with the first word indicating the action, the second the cube colour and the last the speed at which the action is performed (e.g., 'push green slowly')."
引述
"Human infants learn language in their environment while their caregivers describe the properties of objects, which they interact with, and actions, which are performed on those objects. In a similar vein, artificial agents can be taught language; different modalities such as audio, touch, proprioception and vision can be employed towards learning language in the environment."
"To overcome this, we equip the PVAE architecture with the Bidirectional Encoder Representations from Transformers (BERT) language model [6] that has been pretrained on large-scale text corpora to enable the recognition of unconstrained natural language commands by human users."