The article introduces Hulk, a generalist human-centric perceiver capable of handling various tasks without task-specific adaptation. It addresses the challenges in unifying 2D vision, 3D vision, skeleton-based, and vision-language tasks by condensing inputs and outputs into four modalities. The model architecture includes tokenizers and de-tokenizers for semantic words and locations, with an encoder-decoder framework guided by modality-specific indicators. Objective functions include semantic contrastive loss and digit regression loss. Training on diverse datasets covering eight human-centric tasks demonstrates Hulk's superior performance.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Yizhou Wang,... lúc arxiv.org 03-25-2024
https://arxiv.org/pdf/2312.01697.pdfYêu cầu sâu hơn