The paper presents an effective and efficient post-training quantization (PTQ) framework, termed PTQ4RIS, for Referring Image Segmentation (RIS) models. RIS aims to segment the object referred to by a given sentence in an image by understanding both visual and linguistic information. However, existing RIS methods often focus on achieving top performance, overlooking practical considerations for deployment on resource-limited edge devices.
The authors first conduct an in-depth analysis of the root causes of performance degradation in RIS model quantization. They identify two key challenges: 1) the non-normal distribution of post-Softmax and post-GeLU activations in the visual encoder, and 2) the presence of significant outliers in the text encoder activations.
To address these issues, the authors propose two novel quantization techniques:
For the quantization-friendly feature fusion and decoder modules, the authors apply a simple uniform quantization approach.
Extensive experiments on three RIS benchmark datasets with different bit-width settings (from 8 to 4 bits) demonstrate the superior performance of the proposed PTQ4RIS framework. Notably, the PTQ INT8 model's accuracy is almost on par with the full-precision (FP32) model on some datasets, and the performance degradation is minimal even in the W6A6 and W4A8 settings.
Till ett annat språk
från källinnehåll
arxiv.org
Djupare frågor