The article introduces Hulk, a generalist human-centric perceiver capable of handling various tasks without task-specific adaptation. It addresses the challenges in unifying 2D vision, 3D vision, skeleton-based, and vision-language tasks by condensing inputs and outputs into four modalities. The model architecture includes tokenizers and de-tokenizers for semantic words and locations, with an encoder-decoder framework guided by modality-specific indicators. Objective functions include semantic contrastive loss and digit regression loss. Training on diverse datasets covering eight human-centric tasks demonstrates Hulk's superior performance.
Para outro idioma
do conteúdo fonte
arxiv.org
Principais Insights Extraídos De
by Yizhou Wang,... às arxiv.org 03-25-2024
https://arxiv.org/pdf/2312.01697.pdfPerguntas Mais Profundas