Core Concepts
This paper presents three proposals to build the SLIVAR (Spoken Language Interaction with Virtual Agents and Robots) community: (1) creating educational resources, (2) establishing benchmarks and challenges, and (3) integrating large language models effectively with robots while meeting requirements for natural interaction.
Abstract
The paper chronicles the recent history of the growing field of spoken dialogue with robots and offers three proposals to advance the SLIVAR community.
Proposal 1: Educational Resources
Robotics, natural language processing (NLP), spoken dialog systems (SDS), and human-robot interaction (HRI) are separate fields, each requiring substantial educational preparation.
The authors propose creating a central resource to share syllabi, course content, and other educational materials to help train students in this interdisciplinary area.
A sample curriculum covering math, computer science, robotics, data science, AI, and HRI is provided.
The authors have started a GitHub repository to host these educational resources and propose using a forum for discussion.
Proposal 2: Benchmarks & Challenges
Benchmarks help drive research progress by providing a common way to compare work.
The authors outline key requirements for a benchmark on dialogue with robots, including being multimodal, co-located, high-stakes, user-centered, and community-agnostic.
Existing benchmarks like ALFRED, TEACh, and Alexa Arena are discussed, but do not fully meet the proposed requirements.
The authors propose developing a new benchmark infrastructure with a virtual and real-world version, starting with an initial challenge for a cohort of 5 research teams.
Proposal 3: Large Language Models and Robots
Large language models (LLMs) have become prominent in NLP, but have limitations when it comes to grounding language in the physical world for robots.
Challenges include the closed nature of some LLMs, their large size requiring significant compute resources, and the need for less data-hungry models.
The authors encourage research on smaller, more effective multimodal LLMs that can mitigate biases and support diverse user populations when integrated with robots.
Approaches like incorporating physical world representations, incremental processing, and reinforcement learning with human feedback are discussed as promising directions.