Core Concepts
Enhancing human-robot interaction through natural conversations using pre-trained models.
Abstract
The content introduces a framework for improving human-robot interaction by leveraging pre-trained language and visual models. It discusses the challenges in current approaches, the advancements in generative AI and NLP, the proposed dual-modality framework, and the results of real-world experiments. The framework aims to provide more intuitive and natural interactions between humans and autonomous agents.
Abstract:
Extended method for natural interaction with autonomous agents.
Utilizes pre-trained large language models (LLMs) and multimodal visual language models (VLMs).
Achieved high accuracy in vocal commands decoding.
Introduction:
Current approaches dominated by complex teleoperation controllers.
Need for more natural and intuitive interaction mechanisms.
Advancements in generative AI and NLP driving new avenues for HRI.
Method:
Framework overview with five main components.
Focus on vocal conversation decoding using SRNode.
Integration of LLMNode, CLIPNode, REM node, ChatGUI, and SRNode.
Experiments:
Real-world and simulated experiments conducted to validate the framework.
Utilized Unitree Go1 ROS & Gazebo packages for simulation.
Real-world experiments with Lenovo ThinkBook Intel Core i7.
Conclusion and Future Work:
Framework leverages LLMs, VLMs, and SR models for enhanced HRI.
High accuracy in vocal command understanding achieved.
Future work includes refining the framework to resist environmental noise.
Stats
We performed a quantitative evaluation of our framework's natural vocal conversation understanding with participants from different racial backgrounds and English language accents. The participants interacted with the robot using both spoken and textual instructional commands. Based on the logged interaction data, our framework achieved 87.55% vocal commands decoding accuracy, 86.27% commands execution success, and an average latency of 0.89 seconds from receiving the participants' vocal chat commands to initiating the robot's actual physical action.
In comparison to the results obtained in the textual-based method [17], we observed a slight difference in the command recognition accuracy, with VCUA achieving about 12% less than the textual-based CRA (99.13%) as well as an 11.69% reduction in NSR.
Across all selected commands, the average response time (ART) was approximately 0.89 seconds.
Quotes
"Our evaluation from logged human interaction data achieved high vocal command understanding accuracy."
"Our framework can enhance the intuitiveness and naturalness of human-robot interaction in the real world."
"We aim to refine our framework to resist the impact of environmental noise."