Agglomerative Vision Foundation Model: Unifying Diverse Visual Representations into a Single Powerful Model
The authors propose a multi-teacher distillation framework called AM-RADIO that can effectively merge distinct visual foundation models (VFMs) like CLIP, DINOv2, and SAM into a single unified model. This integrated model outperforms the individual teacher models on a wide range of computer vision benchmarks.