Belangrijkste concepten
The authors propose a multi-teacher distillation framework called AM-RADIO that can effectively merge distinct visual foundation models (VFMs) like CLIP, DINOv2, and SAM into a single unified model. This integrated model outperforms the individual teacher models on a wide range of computer vision benchmarks.
Samenvatting
The paper introduces AM-RADIO, a multi-teacher distillation framework that can efficiently train new vision foundation models by unifying the unique attributes of different teacher models.
Key highlights:
AM-RADIO can distill knowledge from multiple VFMs with diverse objectives and characteristics, such as CLIP, DINOv2, and SAM, into a single student model.
The student model not only surpasses the performance of individual teacher models on representative benchmarks, but also inherits their distinctive capabilities, such as zero-shot vision-language understanding, detailed pixel-level understanding, and open vocabulary segmentation.
The authors also propose a novel efficient architecture called E-RADIO that exceeds the performance of its predecessors and is at least 6x faster than the teacher models at matched resolution.
Comprehensive experiments are conducted on tasks including ImageNet classification, semantic segmentation, object detection, and integration into large language models.
The authors demonstrate that by leveraging multi-teacher distillation, the student model can effectively amalgamate the unique strengths of the individual teacher models, leading to a more powerful and versatile vision foundation model.
Statistieken
The proposed E-RADIO model is at least 6x faster than the teacher models at matched resolution.
The RADIO-ViT-H/16 model outperforms the teacher models on 6 out of 9 benchmarks.
Citaten
"We introduce AM-RADIO with the goal of learning from multiple foundational models simultaneously."
"We observe that, when given a student model of sufficient capacity, it is often able to exceed any of its teachers on important axes."