The paper proposes a Chain-of-Thoughts (CoT) Data-Efficient 3D Visual Grounding framework, termed CoT3DRef, that formulates the 3D visual grounding problem as a sequence-to-sequence task. The key idea is to first predict a chain of anchor objects in a logical order, and then use this chain to localize the final target object.
The main components of the framework are:
The proposed framework is data-efficient and can be easily integrated into existing 3D visual grounding architectures. Experiments on the Nr3D, Sr3D, and ScanRefer datasets show that CoT3DRef outperforms state-of-the-art methods, especially when trained on limited data. For example, on the Sr3D dataset, the framework trained on only 10% of the data matches the performance of existing methods trained on the full dataset.
The key advantages of the CoT3DRef framework are its interpretability, data efficiency, and potential to mimic the human perception system. By decomposing the referring task into a sequence of interpretable steps, the framework provides insights into how the model arrives at its final decision, which can help identify and address potential biases or errors.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Eslam Mohame... at arxiv.org 04-23-2024
https://arxiv.org/pdf/2310.06214.pdfDeeper Inquiries