Concepts de base
This work introduces a novel open-world formulation for the problem of temporally localizing the three stages (initial, transitioning, end) of object state changes in videos, addressing the limitations of existing closed-world approaches. The authors propose VIDOSC, a holistic learning framework that leverages text and vision-language models for supervisory signals and develops techniques for object-agnostic state prediction to enable generalization to novel objects.
Résumé
The paper introduces a novel open-world formulation for the problem of temporally localizing object state changes (OSCs) in videos. Existing approaches are limited to a closed vocabulary of known OSCs, whereas the authors aim to enable generalization to novel objects and state transitions.
Key highlights:
- Formally characterize OSCs in videos as having three states: initial, transitioning, and end.
- Propose an open-world setting where the model is evaluated on both known and novel OSCs.
- Develop VIDOSC, a holistic learning framework that:
- Leverages text and vision-language models to generate pseudo-labels and eliminate the need for manual annotation.
- Introduces techniques for object-agnostic state prediction, including a shared state vocabulary, temporal modeling, and object-centric features, to enhance generalization.
- Introduce HowToChange, a large-scale dataset that reflects the long-tail distribution of real-world OSCs, with an order of magnitude increase in the label space compared to previous benchmarks.
- Experimental results demonstrate the effectiveness of VIDOSC, outperforming state-of-the-art approaches in both closed-world and open-world settings.
Stats
The video duration in the HowToChange dataset averages 41.2 seconds.
The HowToChange dataset contains 36,075 training videos and 5,424 manually annotated evaluation videos.
The HowToChange dataset covers 134 objects and 20 state transitions, resulting in 409 unique OSCs.
Citations
"Importantly, there is an intrinsic connection that threads these OSC processes together. As humans, even if we have never encountered an unusual substance (e.g., jaggery) before, we can still understand that it is experiencing a "melting" process based on its visual transformation cues."
"Consistent with prior work [17], we formulate the task as a frame-wise classification problem. Formally, a video sequence is represented by a temporal sequence of T feature vectors X = {x1, x2, · · · , xT }, where T denotes the video duration. The goal is to predict the OSC state label for each timepoint, denoted by Y = {y1, y2, · · · , yT }, yt ∈{1, · · · , K + 1}, where K is the total number of OSC states, and there is one additional category representing the background class not associated with any OSC state."