The content discusses the challenges of deploying Vision-Language models on edge devices and introduces EdgeVL as a solution. It highlights the importance of adapting large models for diverse visual modalities without manual annotations, showcasing significant accuracy improvements and efficiency enhancements.
Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices, yet challenges in handling diverse visual modalities, manual annotation, and computational constraints remain. The introduction of EdgeVL bridges this gap by seamlessly integrating dual-modality knowledge distillation and quantization-aware contrastive learning. This approach enables the adaptation of large VL models for efficient use with both RGB and non-RGB images on resource-limited devices without the need for manual annotations.
In recent years, there has been a surge of interest in the development of Vision-Language (VL) models capable of conducting integrated reasoning across visual and textual data. Prominent large-scale VL models typically employ distinct visual and text encoders to enable direct comparison across two modalities.
An edge device often comes equipped with multiple sensors beyond standard RGB cameras, such as depth sensors and infrared cameras. Despite this, most large VL models are tailored to RGB images, leaving adaptability to alternative inputs largely unexplored.
To overcome these challenges, a novel framework is needed that can adapt the VL embedding prowess of large models to non-RGB images without relying on human annotations while minimizing its computational footprint to suit edge device capabilities.
The proposed EdgeVL framework integrates dual-modality knowledge distillation with quantization-aware training to optimize model efficiency for open-vocabulary classification tasks across various visual modalities on resource-constrained edge devices.
EdgeVL represents a systematic effort to adapt large VL models for edge deployment, showcasing accuracy improvements on multiple datasets and reduction in model size up to 93-fold.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Kaiwen Cai,Z... om arxiv.org 03-11-2024
https://arxiv.org/pdf/2403.04908.pdfDiepere vragen