The paper introduces the task of "Out of Sight, Not Out of Mind" (OSNOM) for egocentric video, which aims to maintain the 3D locations of active objects even when they are out of the camera's field of view. The authors propose a method called Lift, Match and Keep (LMK) that lifts 2D observations of objects to 3D world coordinates, matches them over time using appearance and location, and keeps track of the objects' locations even when they are out of sight.
The key highlights and insights are:
LMK significantly outperforms baseline methods, including a state-of-the-art approach adapted for the OSNOM task. It achieves a 39% average improvement in tracking performance up to 1 minute, and 25% from 1 to 12 minutes.
Maintaining 3D world locations is critical for correctly locating moving objects, and when they are occluded or out of view. Approaches that only track objects within the camera's field of view perform poorly.
LMK is able to correctly locate 64% of objects after 1 minute, 48% after 5 minutes, and 37% after 10 minutes, demonstrating its ability to maintain spatial awareness over both short and long time scales.
Qualitative results show LMK can accurately track objects when they are static on surfaces and when they are being moved around by the camera wearer.
Ablations demonstrate the importance of combining visual appearance and 3D location features for robust object matching, and the benefits of maintaining object permanence when objects reappear after going out of view.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Chiara Plizz... at arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.05072.pdfDeeper Inquiries