Core Concepts
LAVIMO introduces a novel framework for three-modality learning, integrating human-centric videos to enhance alignment between text and motion modalities.
Stats
LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks.
HumanML3D dataset features 14,616 motions and 44,970 textual descriptions.
KIT-ML dataset contains 3,911 sequences and 6,278 text annotations.
Quotes
"Our key contributions are summarized as follows: We introduce LAnguage-VIdeo-MOtion Alignment (LAVIMO), a framework designed to cultivate a cohesive embedding space across the three aforementioned modalities."
"Our model demonstrates precision in retrieving the exact ground-truth motion in the rank1 position."