핵심 개념
Chameleon is a versatile and data-efficient generalist model that can flexibly adapt to a wide range of unseen dense visual prediction tasks using only a small number of labeled examples.
초록
The paper introduces Chameleon, a data-efficient generalist model for diverse dense visual prediction tasks. Key highlights:
-
Chameleon is designed to be a versatile model that can adapt to arbitrary dense prediction tasks with unique input modalities, output structures, and semantics, using only a small number of labeled examples (dozens).
-
The model is based on the Visual Token Matching (VTM) framework, with several improvements to enhance its performance and versatility:
- A flexible encoding mechanism to handle variable multi-modal inputs.
- A task-adaptive feature re-weighting module in the hierarchical architecture to better associate image and label features.
- Scaling up the model capacity and resolution, as well as meta-training on a large-scale diverse dataset.
-
Chameleon is evaluated on six downstream benchmarks covering a wide range of real-world scenarios, including video, 3D, medical, biological, and user-interactive tasks. It significantly outperforms existing generalist baselines, demonstrating its effectiveness in low-shot learning of diverse dense visual prediction tasks.
-
The paper's analyses suggest that the key factors for Chameleon's success are the effective encoding mechanism, flexible adaptation, and meta-training on a rich dataset.
통계
Chameleon achieves 67.2% AP on animal keypoint detection, 85.2% ADD on 6D pose estimation, 88.5% F1 on skin lesion segmentation, 77.5% J&F on video object segmentation, 12.0 MAE on object counting, and 70.3% AP50 on cell instance segmentation.
Chameleon uses at most 50 labeled examples per task for fine-tuning, except for DAVIS 2017 (1-shot) and ISIC 2018 (20-shot).
인용구
"Chameleon successfully adapts to each scenario using at most 50 labeled examples per task, significantly outperforming the generalist baselines."
"Our extensive analyses also suggest that effective encoding mechanism with flexible adaptation and meta-training on a rich dataset are the key factors of successful generalization to out-of-distribution tasks."