Core Concepts
Virtually augmenting the NYU Depth V2 dataset with randomly generated 3D objects improves the performance and generalization of deep neural networks for monocular depth estimation.
Abstract
The paper presents ANYU, a new virtually augmented version of the NYU Depth V2 dataset, designed for improving monocular depth estimation. Unlike previous approaches that use full 3D scenes of a virtual world, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU Depth V2 images.
The authors deliberately did not match each virtual object with an appropriate texture and location within the real-world image. Instead, they used randomness in generating virtual objects and choosing their placement to maximize the diversity of the training data. Extensive experiments on the original and virtually enriched NYU Depth V2 datasets, as well as the iBims-1 benchmark, show that ANYU improves the monocular depth estimation performance and generalization of deep neural networks with different architectures, especially the state-of-the-art VPD model.
The key findings are:
- Supplementing the NYU Depth V2 dataset with 5-40% of virtually augmented RGB-D image pairs yields the best depth estimation results, with further increase in augmentation leading to a slight degradation but still better performance than the original dataset.
- Reducing the size of the original NYU Depth V2 training set and expanding it with virtually enriched images significantly mitigates the performance drop, demonstrating the importance of data diversity.
- The VPD and PixelFormer models trained on ANYU achieve new state-of-the-art results on the NYU Depth V2 test set and improved cross-dataset generalization on the iBims-1 benchmark.
The authors make the ANYU dataset publicly available in two configurations with 10% and 100% additional synthetically enriched RGB-D pairs for efficient training and empirical exploration of virtual augmentation.
Stats
Reducing the NYU Depth V2 training set from 100% to 1% increases the RMSE by a factor of 2, while expanding the remaining 1% with virtually enriched images improves the RMSE by 25%.
Training the VPD model on the full NYU Depth V2 dataset with 10% virtually augmented images achieves the lowest RMSE of 0.248, compared to 0.254 without augmentation.
The PixelFormer model trained on ANYU improves the RMSE from 0.322 to 0.320 on the NYU Depth V2 test set, and from 1.595 to 1.539 on the iBims-1 benchmark.
The VPD model trained on ANYU achieves a new state-of-the-art RMSE of 1.365 on the iBims-1 dataset, compared to 1.493 without augmentation.
Quotes
"In contrast to the well-known approach where full 3D scenes of a virtual world are utilized to generate artificial datasets, ANYU was created by incorporating RGB-D representations of virtual reality objects into the original NYU depth v2 images."
"We specifically did not match each generated virtual object with an appropriate texture and a suitable location within the real-world image. Instead, an assignment of texture, location, lighting, and other rendering parameters was randomized to maximize a diversity of the training data, and to show that it is randomness that can improve the generalizing ability of a dataset."