Core Concepts
The author introduces the ASPIRe dataset and the Hierarchical Interlacement Graph (HIG) model to address the Visual Interactivity Understanding problem, offering a new benchmark and a unified hierarchical structure for capturing complex video interactivities.
Abstract
The content discusses the challenges in visual interactivity understanding and introduces the ASPIRe dataset with five interactivity types. The HIG model is proposed to provide deep insights into scene changes across different tasks, showcasing superior performance through extensive experiments. The methodology, training loss, ablation study, comparison with state-of-the-art methods, and limitations are discussed comprehensively.
Stats
ASPIRe dataset contains 1.5K videos with 833 object categories and 4.5K interactivities.
HIG model outperforms baseline methods on single-actor attributes by 2.67% at R@20 compared to Transformer.
HIG achieves improvements of 3.55%, 5.82%, and 6.73% at R@100 for position, interaction, and relation compared to GPSNet.
HIG shows a decrease in recall performance but a 2.2 FPS increase in inference speed when halving the number of frames.
HIG model performs well on PSG dataset with comparable results to state-of-the-art methods.
Quotes
"The proposed HIG framework integrates the evolution of interactivities over time."
"HIG operates with a unique unified layer at every level to jointly process interactivities."