Kernkonzepte
提案されたSeCGは、複数の参照オブジェクトを含む記述の理解を向上させるために、グラフ注意力を活用した意味強化ビジュアルグラウンディングモデルです。
Zusammenfassung
I. Abstract
3D visual grounding aims to locate the specified object in a 3D region based on textual descriptions.
Existing methods struggle with distinguishing similar objects, especially in complex referential relationships.
SeCG proposes a semantic-enhanced relational learning model using graph attention for better cross-modal alignment.
II. Introduction
Vision and language are crucial for computer understanding of real 3D scenes.
Multi-modal learning has led to the emergence of challenging tasks like 3D visual grounding.
The core challenge lies in perceiving referential relationships accurately.
III. Methods
A. Overview
Scene point clouds are segmented into objects for further processing.
The proposed model consists of semantic-enhanced visual encoding, relation graph learning, text encoding, and Transformer decoding.
B. Semantic-enhanced Visual Encoding
PointNet++ is used as the backbone for encoding initial point clouds.
A semantic point cloud is generated to provide high-level semantics for better understanding of relationships.
C. Relation Graph Learning
A full-connected graph is constructed to learn implicit relationships among objects using a graph attention network.
Two sub-modules, Auxiliary Memory Unit and Multi-view Position Embedding, enhance the intrinsic attention algorithm.
IV. Experiment & Results
A. Datasets & Evaluation Metrics
Nr3D and Sr3D datasets are used for evaluation with different subsets based on complexity and viewpoint dependency.
ScanRefer dataset evaluates localization accuracy based on IoU thresholds with unique and multiple samples.
B. Localization Results & Visualization
SeCG outperforms existing methods in overall accuracy on both Nr3d and Sr3d datasets.
Adding 2D features improves localization accuracy but may hinder performance in scenarios with multiple same-class objects.
V. Conclusion
SeCG effectively addresses the challenge of weak understanding of multiple referred objects in 3D visual grounding tasks through semantic enhancement and relational learning using graph attention.
Statistiken
提案されたSeCGは、複数の参照オブジェクトを含む記述の理解を向上させるために、グラフ注意力を活用した意味強化ビジュアルグラウンディングモデルです。