The paper introduces SAGE, a comprehensive framework for language-guided manipulation of articulated objects. The key insight is to separately model the cross-category commonality in part semantics and part actions, bridging the potential discordance between the two.
The framework consists of the following main components:
Part-aware Scene Perception: A scene description is generated by fusing the outputs of a general-purpose Visual Language Model (VLM) and a domain-specific 3D part perception model. This provides rich context and accurate part-level information.
Instruction Interpretation and Global Planner: The natural language instruction is interpreted by a large language model (GPT-4V) and translated into executable action programs defined on the semantic parts. A global planner tracks the execution progress and adjusts the strategy when necessary.
Part Grounding and Execution: The semantic parts are mapped to Generalizable Actionable Parts (GAParts) using a part grounding module. Executable trajectories are then generated on the GAParts to complete the manipulation.
Interactive Feedback: An interactive perception module leverages observations during execution to refine the part state estimation and update the global planner, improving the overall robustness.
The proposed method is evaluated extensively in simulation environments covering diverse articulated objects and tasks. It also demonstrates strong performance on real-world robot experiments, showcasing its ability to handle complex, language-guided manipulation challenges.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問