Core Concepts
The authors propose an efficient and effective method to fine-tune a pre-trained vision foundation model (VFM) into a mixed-precision quantized supernet using a low-rank adapter (LoRA). The method addresses the challenges of optimizing the mixed-precision search space and reducing the large memory cost during training.
Abstract
The authors focus on efficiently processing and analyzing content for insights by fine-tuning a pre-trained VFM, specifically the Segment Anything Model (SAM), into a mixed-precision quantized supernet.
Key highlights:
The authors analyze the effective search space design for fine-tuning a VFM by comparing different operators such as resolution, feature size, width, depth, and bit-widths in terms of performance and bit-wise operations (BitOPs) reduction.
The authors propose a memory-efficient supernet training method using a low-rank adapter (LoRA) and a progressive training strategy to address the large memory cost during training.
The proposed LoRA-based architectures, including selective and multiplex methods, are introduced to improve the representation capacity and avoid the gradient conflict issue during supernet training.
The proposed method is evaluated on semantic and instance segmentation tasks, where it outperforms state-of-the-art mixed-precision supernet methods and achieves about a 95% reduction in BitOPs without incurring performance degradation.
Stats
The authors report the following key metrics:
SAM's FLOPs are approximately 2900G for the ViT-L image encoder.
SAM requires at least 48G memory GPUs for training the mixed-precision search space.
The proposed method reduces the memory cost by approximately 18% compared to the state-of-the-art method QFA*.
The searched model yields about a 95% reduction in BitOPs without incurring performance degradation.
Quotes
"Compression of large and performant vision foundation models (VFMs) into arbitrary bit-wise operations (BitOPs) allows their deployment on various hardware."
"We propose to fine-tune a VFM to a mixed-precision quantized supernet. The supernet-based neural architecture search (NAS) can be adopted for this purpose, which trains a supernet, and then subnets within arbitrary hardware budgets can be extracted."
"We focus on combining two complex search spaces of mixed-precision quantization and supernet-based NAS to reduce the BitOPs of VFMs."