Gajewski, P., Gonzalez, A. G. C., & Indurkhya, B. (2024). Context-Aware Command Understanding for Tabletop Scenarios. arXiv preprint arXiv:2410.06355.
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios, enabling robots to understand and execute instructions given through speech and gestures.
The algorithm integrates a procedural control flow with multiple state-of-the-art deep learning models. The procedural component manages data flow and decision-making, while deep learning models handle sub-tasks like speech recognition (Whisper), text-based reasoning (Phi-3), object detection (GroundingDINO), pointing gesture understanding (Mediapipe), depth estimation (DINOv2), and object segmentation (SegmentAnything).
The algorithm effectively extracts actionable instructions for a robot by identifying relevant objects and actions from multimodal input. It operates in a zero-shot fashion, eliminating the need for predefined object models and enabling flexibility across various environments. Evaluation using a newly released dataset of real-world human-robot interactions demonstrates robust performance across different tasks.
The proposed hybrid algorithm successfully combines language processing with visual grounding for robust command interpretation in tabletop scenarios. The integration of multiple deep learning models proves effective, though their limitations, particularly in hand detection and speech recognition, highlight areas for future improvement.
This research contributes to the field of Human-Robot Interaction by presenting a novel approach to multimodal command understanding, particularly relevant for tabletop scenarios. The released dataset further benefits the research community by providing valuable data for future studies.
The algorithm's performance is currently limited by the accuracy of the underlying deep learning models, particularly in handling motion blur and non-native speaker variations. Future work will focus on improving robustness against recognition errors, handling continuous data streams, and integrating the algorithm into a larger robotic system for real-world task execution.
To Another Language
from source content
arxiv.org
Viktige innsikter hentet fra
by Paul Gajewsk... klokken arxiv.org 10-10-2024
https://arxiv.org/pdf/2410.06355.pdfDypere Spørsmål