核心概念
An effective and efficient post-training quantization framework, PTQ4RIS, is proposed to enable the deployment of large multi-modal referring image segmentation models on resource-constrained edge devices.
要約
The paper presents an effective and efficient post-training quantization (PTQ) framework, termed PTQ4RIS, for Referring Image Segmentation (RIS) models. RIS aims to segment the object referred to by a given sentence in an image by understanding both visual and linguistic information. However, existing RIS methods often focus on achieving top performance, overlooking practical considerations for deployment on resource-limited edge devices.
The authors first conduct an in-depth analysis of the root causes of performance degradation in RIS model quantization. They identify two key challenges: 1) the non-normal distribution of post-Softmax and post-GeLU activations in the visual encoder, and 2) the presence of significant outliers in the text encoder activations.
To address these issues, the authors propose two novel quantization techniques:
- Dual-Region Quantization (DRQ) for the visual encoder, which quantizes the activation values in two separate regions to better capture the bimodal distribution.
- Reorder-based Outlier-Retained Quantization (RORQ) for the text encoder, which iteratively partitions the activation values into groups and dynamically quantizes each group using distinct scale factors to handle the outliers.
For the quantization-friendly feature fusion and decoder modules, the authors apply a simple uniform quantization approach.
Extensive experiments on three RIS benchmark datasets with different bit-width settings (from 8 to 4 bits) demonstrate the superior performance of the proposed PTQ4RIS framework. Notably, the PTQ INT8 model's accuracy is almost on par with the full-precision (FP32) model on some datasets, and the performance degradation is minimal even in the W6A6 and W4A8 settings.
統計
The paper does not provide any specific numerical data or statistics. The key results are presented in the form of quantitative performance metrics, such as MIoU and OIoU, across different bit-width settings and benchmark datasets.
引用
"Referring Image Segmentation (RIS), aims to segment the object referred by a given sentence in an image by understanding both visual and linguistic information."
"Existing RIS methods tend to explore top-performance models, disregarding considerations for practical applications on resources-limited edge devices."
"We unveil the root causes of performance collapse in the quantization of the RIS model, revealing that challenges primarily arise from the unique activation distributions of post-Softmax and post-GeLU in visual encoder, along with activation outliers in text encoder."