Core Concepts
Understanding how physics attributes and background characteristics impact Video Transformers in predicting planar pushing trajectories is crucial for model performance.
Abstract
The study investigates the influence of physics attributes (color, friction coefficient, shape) and scene background characteristics on Video Transformers. A large real-world dataset, CloudGripper-Push-1K, comprising 1278 hours and 460,000 videos of planar pushing interactions is introduced. The research aims to determine the factors affecting model generalization and the amount of data needed for training. Three types of 2D-spatial encoders are explored within a generic Video Occlusion Transformer framework. Results show that VTs' performance declines with increasing background complexity but improves with dynamic backgrounds. Color has the most significant impact on performance compared to friction coefficient and shape. Fine-tuning models with limited data can lead to varying performance outcomes.
Stats
CloudGripper-Push-1K dataset: 1278 hours, 460,000 videos
Training set size: 20,000 videos; Testing set size: 2,000 videos
Quotes
"VTs' performance appears to decline with increasing background scene complexity."
"Color has the most significant impact on VTs' performance."
"Fine-tuning models with limited data can lead to varying performance outcomes."