The article introduces Hulk, a generalist human-centric perceiver capable of handling various tasks without task-specific adaptation. It addresses the challenges in unifying 2D vision, 3D vision, skeleton-based, and vision-language tasks by condensing inputs and outputs into four modalities. The model architecture includes tokenizers and de-tokenizers for semantic words and locations, with an encoder-decoder framework guided by modality-specific indicators. Objective functions include semantic contrastive loss and digit regression loss. Training on diverse datasets covering eight human-centric tasks demonstrates Hulk's superior performance.
In eine andere Sprache
aus dem Quellinhalt
arxiv.org
Wichtige Erkenntnisse aus
by Yizhou Wang,... um arxiv.org 03-25-2024
https://arxiv.org/pdf/2312.01697.pdfTiefere Fragen