Core Concepts
A novel vision-language model-based approach for socially compliant robot navigation in human-centric environments.
Abstract
The paper proposes VLM-Social-Nav, a novel approach for social robot navigation that integrates vision-language models (VLMs) with optimization-based or scoring-based local motion planners and a state-of-the-art perception model.
The key highlights are:
VLM-Social-Nav leverages a VLM to analyze and reason about the current social interaction and generate an immediate preferred robot action to guide the motion planner. This enables robots to detect social entities efficiently and make real-time decisions on socially compliant robot behavior.
The paper introduces a VLM-based scoring module that translates the current robot observation and textual instructions into a relevant social cost term, which is then used by the bottom-level motion planner to output appropriate robot actions.
The approach is evaluated in four different real-world indoor social navigation scenarios. VLM-Social-Nav achieves at least 36.37% improvement in average success rate and 20.00% improvement in average collision rate compared to other methods. The user study also shows that VLM-Social-Nav generates the most socially compliant navigation behavior.
Stats
The robot is expected to navigate at a constant speed of 0.28 m/s.
The robot is expected to move to the right when passing by a person.
The robot is expected to not obstruct others' paths.
The robot is expected to pass on the left when overtaking a person.
Quotes
"Move RIGHT with SLOWING DOWN"
"STOP, slow down"