Core Concepts
A genetic algorithm-based quantization-aware approximation method (GQA-LUT) is proposed to efficiently handle diverse non-linear operations in Transformers using integer-only arithmetic.
Abstract
The paper presents a genetic algorithm-based quantization-aware approximation method called GQA-LUT to efficiently handle non-linear operations in Transformer models.
Key highlights:
Non-linear functions like GELU, Softmax, and LayerNorm are prevalent in Transformers and their lightweight variants, incurring substantial hardware costs. Previous works have optimized these operations using piece-wise linear (pwl) approximation with look-up tables (LUT), but they lack consideration of integer-only quantization.
The authors analyze the interplay between the scaling factor in quantization and the LUT parameters, and formulate a general quantization-aware LUT-based approximation flow.
The proposed GQA-LUT algorithm automatically determines the optimal LUT parameters with quantization awareness, overcoming the limitations of existing methods that fail to adjust parameters based on scaling factors.
A rounding mutation algorithm is introduced to handle the breakpoint deviation issue caused by large scaling factors, further improving the accuracy of GQA-LUT.
Experiments show that the INT8-based GQA-LUT achieves negligible accuracy degradation on semantic segmentation tasks compared to high-precision alternatives, while enabling significant hardware savings of 81.3-81.7% in area and 79.3-80.2% in power.
Stats
The paper reports the following key figures:
GQA-LUT with 8-entry LUT achieves 81.3-81.7% area savings and 79.3-80.2% power reduction compared to high-precision FP/INT32 alternatives.
Expanding the LUT storage to 16 entries results in a 1.71x increase in area and 1.95x increase in power relative to the 8-entry INT8 configuration.
Quotes
"The results demonstrate that GQA-LUT achieves negligible degradation on the challenging semantic segmentation task for both vanilla and linear Transformer models."
"The area and power performance synthesized with TSMC 28-nm technology demonstrates that the INT8-based arithmetics achieve significant improvements, compared to the high-precision FP/INT32 units."