核心概念
A novel interpretable framework for 3D visual grounding that predicts a chain of anchor objects to localize the final target object, enhancing performance and data efficiency.
摘要
The paper proposes a Chain-of-Thoughts (CoT) Data-Efficient 3D Visual Grounding framework, termed CoT3DRef, that formulates the 3D visual grounding problem as a sequence-to-sequence task. The key idea is to first predict a chain of anchor objects in a logical order, and then use this chain to localize the final target object.
The main components of the framework are:
Pathway Module: Extracts the anchor objects from the input utterance and predicts their logical order.
CoT Decoder: Takes the multi-modal features, the parallel localized objects, and the logical path as input, and sequentially localizes the anchors and the target object.
Pseudo Label Generator: Automatically generates pseudo-labels for the anchor objects and their logical order, without requiring any manual annotations.
The proposed framework is data-efficient and can be easily integrated into existing 3D visual grounding architectures. Experiments on the Nr3D, Sr3D, and ScanRefer datasets show that CoT3DRef outperforms state-of-the-art methods, especially when trained on limited data. For example, on the Sr3D dataset, the framework trained on only 10% of the data matches the performance of existing methods trained on the full dataset.
The key advantages of the CoT3DRef framework are its interpretability, data efficiency, and potential to mimic the human perception system. By decomposing the referring task into a sequence of interpretable steps, the framework provides insights into how the model arrives at its final decision, which can help identify and address potential biases or errors.
統計資料
To reach the chair target, we first have to localize the white and red boxes, then the bookshelf.
Training on only 10% of the data is enough to beat all the baselines, which are trained on the entire data.
引述
"Can we design an interpretable 3D visual grounding framework that has the potential to mimic the human perception system?"
"Understanding the CoT is crucial for several reasons. Firstly, it helps explain how the model arrived at its decision, which is essential for transparency and interpretability. Secondly, it helps identify potential biases or errors in the model, which can be addressed to improve its accuracy and reliability. Third, it is a critical step toward intelligent systems that mimic human perception."