Core Concepts
A hybrid framework that combines automated object tracking with minimal human input by leveraging self-supervised learning to intelligently detect and correct tracker failures.
Abstract
The proposed framework, called SSLTrack, aims to consistently produce high-quality object tracks with minimal human involvement. It combines an automated object tracker with a self-supervised learning module that learns a tailored representation for the target objects.
The key idea is to use this self-supervised representation to actively monitor the tracked object and detect when the tracker is failing. When a failure is detected, the framework solicits human input to re-localize the object and continue tracking. This approach allows the framework to be agnostic to the underlying automated tracker used, benefiting from ongoing progress in improving such trackers.
The experiments demonstrate the versatility of the framework, showing that it outperforms existing hybrid tracking methods, including state-of-the-art approaches, across three challenging datasets. The advantage is particularly pronounced for small, fast-moving, or occluded objects, which are the most challenging cases for automated trackers.
The framework first learns the self-supervised object representation offline using unlabeled videos. During online tracking, it compares the representation of the tracked object to the reference template to detect tracking failures. A neighborhood search approach is used to intelligently select the frames that need human annotation, minimizing the overall human effort required.
The experiments show that the proposed framework can achieve higher tracking accuracy with less human involvement compared to prior work. For example, on the ImageNet VID dataset, the framework achieves 0.84 recall using only 4.5 boxes per object, whereas the state-of-the-art hybrid approach requires 4.9 boxes per object to achieve 0.81 recall. This translates to significant savings in annotation time and cost for large-scale datasets.
Stats
It would take at least 427 hours to annotate the standard MOT16 tracking dataset, which has 14 videos, at a rate of 5.25 seconds per bounding box.
Annotating a dataset of 1 million objects would cost $107,190 using the state-of-the-art hybrid approach, compared to $98,445 using the proposed framework.
Quotes
"The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking."
"Since labeled data is not needed, our approach can be applied to novel object categories."