Core Concepts
The core message of this article is that a structured semantic space based on linguistic knowledge can bridge the semantic gap across diverse action datasets, enabling more effective and transferable action understanding models.
Abstract
The article addresses the "isolated islands" problem in action understanding, where existing datasets have incompatible semantic spaces due to ambiguous and inconsistent class definitions. To address this, the authors propose a structured semantic space based on the VerbNet linguistic hierarchy, which provides unambiguous verb node definitions, rich semantic and geometric knowledge, and hierarchical structure.
The key highlights are:
The authors design a structured semantic space using the VerbNet verb taxonomy, which covers over 5,800 verbs and provides clear semantic and geometric information for each node.
They build a unified "Pangea" database by aligning the classes of 28 diverse action datasets (image, video, skeleton, MoCap) to the structured semantic space.
They propose a Physical-to-Semantic (P2S) mapping model that effectively leverages the structured semantic space for action understanding, showing significant transfer learning capabilities.
Extensive experiments demonstrate the superiority of the P2S model, especially in few-shot and zero-shot settings, compared to baselines like CLIP.
The authors also show how the structured semantic space can enable semantic-to-physical generation of actions.
Overall, the work presents a principled approach to address the long-standing challenge of semantic gaps in action understanding, paving the way for more generalizable and transferable action learning models.
Stats
The Pangea database contains 19.5M images, 1.1M videos, and 840K 3D human samples across 28 diverse datasets.
The structured semantic space covers 513 verb nodes out of the 898 nodes in the VerbNet hierarchy, including 290 fine-grained leaf nodes.
Quotes
"We argue that we need a more principled semantic space to concentrate the community efforts and use all datasets together to pursue generalizable action learning."
"Our space has four-fold superiority: (1) Unambiguous verb nodes correlating all related verbs, (2) Rich knowledge, (3) Hierarchy to represent actions from abstract to specific granularity, (4) Extensive coverage."