Learning to Temporally Localize Object State Changes in Videos: An Open-World Approach
This work introduces a novel open-world formulation for the problem of temporally localizing the three stages (initial, transitioning, end) of object state changes in videos, addressing the limitations of existing closed-world approaches. The authors propose VIDOSC, a holistic learning framework that leverages text and vision-language models for supervisory signals and develops techniques for object-agnostic state prediction to enable generalization to novel objects.