The paper introduces a novel problem of localizing a query image within a database of 3D scene graphs that integrate multiple modalities. The proposed method, SceneGraphLoc, addresses this challenge by learning a fixed-sized embedding for each node in the scene graph, which allows for effective matching with the objects visible in the input query image.
The key highlights and insights are:
SceneGraphLoc significantly outperforms other cross-modal methods, even without incorporating images into the map embeddings. When images are leveraged, SceneGraphLoc achieves performance close to that of state-of-the-art techniques depending on large image databases, while requiring three orders-of-magnitude less storage and operating orders-of-magnitude faster.
The method uses contrastive learning to learn a joint embedding space for the scene graph nodes and image patches, making them matchable. This enables the coarse localization of a query image within the database of 3D scene graphs.
Experiments on the 3RScan and ScanNet datasets demonstrate the effectiveness of SceneGraphLoc, with the best performance achieved when integrating all proposed modalities (point clouds, images, attributes, structure, and relationships).
Qualitative analysis shows that the localization performance is related to the diversity of objects observed in the query image, with more diverse and distinctive objects leading to better retrieval results.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania