核心概念
Hulk is a versatile model that unifies diverse human-centric tasks without task-specific finetuning.
摘要
The article introduces Hulk, a generalist human-centric perceiver capable of handling various tasks without task-specific adaptation. It addresses the challenges in unifying 2D vision, 3D vision, skeleton-based, and vision-language tasks by condensing inputs and outputs into four modalities. The model architecture includes tokenizers and de-tokenizers for semantic words and locations, with an encoder-decoder framework guided by modality-specific indicators. Objective functions include semantic contrastive loss and digit regression loss. Training on diverse datasets covering eight human-centric tasks demonstrates Hulk's superior performance.
- Introduction to Hulk: Presents Hulk as a universal knowledge translator.
- Model Architecture: Details the design of tokenizers, de-tokenizers, encoder-decoder framework.
- Objective Functions: Explains semantic contrastive loss and digit regression loss.
- Training Datasets: Lists datasets used for training Hulk.
- Evaluation Datasets: Mentions benchmark datasets for evaluating Hulk's performance.
統計資料
CrowdHuman Detection: MR−2 = 0.7%
COCO 2D Pose Estimation: AP = 85.8%
RAPv2 Attribute Recognition: mA = 71.3%
引述
"Hulk is the first multimodal human-centric generalist model."
"Hulk outperforms current leading specialist models."
"Hulk simplifies input-output heterogeneity into two basic formats."