The paper introduces Chameleon, a data-efficient generalist model for diverse dense visual prediction tasks. Key highlights:
Chameleon is designed to be a versatile model that can adapt to arbitrary dense prediction tasks with unique input modalities, output structures, and semantics, using only a small number of labeled examples (dozens).
The model is based on the Visual Token Matching (VTM) framework, with several improvements to enhance its performance and versatility:
Chameleon is evaluated on six downstream benchmarks covering a wide range of real-world scenarios, including video, 3D, medical, biological, and user-interactive tasks. It significantly outperforms existing generalist baselines, demonstrating its effectiveness in low-shot learning of diverse dense visual prediction tasks.
The paper's analyses suggest that the key factors for Chameleon's success are the effective encoding mechanism, flexible adaptation, and meta-training on a rich dataset.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Donggyun Kim... klo arxiv.org 04-30-2024
https://arxiv.org/pdf/2404.18459.pdfSyvällisempiä Kysymyksiä