The paper presents an effective and efficient post-training quantization (PTQ) framework, termed PTQ4RIS, for Referring Image Segmentation (RIS) models. RIS aims to segment the object referred to by a given sentence in an image by understanding both visual and linguistic information. However, existing RIS methods often focus on achieving top performance, overlooking practical considerations for deployment on resource-limited edge devices.
The authors first conduct an in-depth analysis of the root causes of performance degradation in RIS model quantization. They identify two key challenges: 1) the non-normal distribution of post-Softmax and post-GeLU activations in the visual encoder, and 2) the presence of significant outliers in the text encoder activations.
To address these issues, the authors propose two novel quantization techniques:
For the quantization-friendly feature fusion and decoder modules, the authors apply a simple uniform quantization approach.
Extensive experiments on three RIS benchmark datasets with different bit-width settings (from 8 to 4 bits) demonstrate the superior performance of the proposed PTQ4RIS framework. Notably, the PTQ INT8 model's accuracy is almost on par with the full-precision (FP32) model on some datasets, and the performance degradation is minimal even in the W6A6 and W4A8 settings.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Xiaoyan Jian... at arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.17020.pdfDeeper Inquiries