The paper proposes a two-step framework called ECHO to generate natural and meaningful human-robot interactions.
First, the authors build a shared latent space that represents the semantics of human and robot poses, enabling effective motion retargeting between them. This shared space is learned without the need for annotated human-robot skeleton pairs.
Second, the ECHO architecture operates in this shared space to forecast human motions in social scenarios. It first learns to predict individual human motions using a self-attention transformer. Then, it iteratively refines these motions based on the surrounding agents using a cross-attention mechanism. This refinement process ensures the generated motions are socially compliant and synchronized.
The authors evaluate ECHO on the large-scale InterGen dataset for social motion forecasting and the CHICO dataset for human-robot collaboration tasks. ECHO outperforms state-of-the-art methods by a large margin in both settings, demonstrating its effectiveness in generating natural and accurate human-robot interactions.
The key innovations include:
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Esteve Valls... lúc arxiv.org 04-09-2024
https://arxiv.org/pdf/2402.04768.pdfYêu cầu sâu hơn