核心概念
Enhancing multimodal alignment through touch-language-vision datasets.
要約
Tactility enhances human and robotic perception. The Touch-Language-Vision (TLV) dataset aligns touch, language, and vision for semantic understanding. TLV-Link fine-tunes a training framework with minimal parameter adjustments. Multimodal alignment is crucial for robotics and AI advancements. Vision-based tactile sensors like GelSight capture detailed information. Existing tactile datasets lack rich textual descriptions hindering cross-modal alignment. TLV bridges this gap with sentence-level descriptions for 20,000 synchronized observations. TLV-Link shows promise in tactile classification tasks with significant performance improvements.
統計
The new dataset features sentence-level descriptions for multimode alignment.
TLV-Link achieves effective semantic alignment with minimal parameter adjustments (1%).
TLV incorporates three modalities: touch, language, and vision.
TLV dataset has annotated text-based descriptions for 20,000 pairs of synchronized tactile and visual observations.
引用
"Neverthe-less, the multimodal research related to touch primarily focuses on visual and tactile modalities."
"Tactility allows us to perceive the texture, temperature, and hardness of objects etc."
"We construct a touch-language-vision dataset named TLV by human-machine cascade collaboration."
"TLV incorporates three modalities: touch, language, and vision."
"To assess TLV’s efficacy, we employ it as the training dataset."