The paper proposes a novel framework called COLLAGE for generating collaborative agent-object-agent interactions. The key highlights are:
COLLAGE incorporates the knowledge and reasoning abilities of large language models (LLMs) to guide a generative diffusion latent diffusion model, addressing the lack of rich datasets in this domain.
The hierarchical vector-quantized variational autoencoder (VQ-VAE) architecture captures different motion-specific characteristics at multiple levels of abstraction, avoiding redundant concepts and enabling efficient multi-resolution representation.
The diffusion model operates in the latent space and incorporates LLM-generated motion planning cues to guide the denoising process, resulting in prompt-specific motion generation with greater control and diversity.
Experimental results on the CORE-4D and InterHuman datasets demonstrate the effectiveness of COLLAGE in generating realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods.
The proposed approach opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics, and computer vision.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Divyanshu Da... at arxiv.org 10-01-2024
https://arxiv.org/pdf/2409.20502.pdfDeeper Inquiries