The author proposes a self-supervised method to jointly learn 3D motion and depth from monocular videos, addressing the limitations of existing methods in modeling dynamic scenes.
This research proposes Motion-Aware Loss (MAL), a novel plug-and-play module that leverages temporal coherence and an enhanced distillation scheme to improve the accuracy of multi-frame self-supervised monocular depth estimation, particularly in dynamic scenes.