Core Concepts
A novel 1/1.58-bit spiking language model architecture that leverages knowledge distillation and equilibrium-based training to achieve significant energy and power efficiency while maintaining competitive performance on natural language processing tasks.
Abstract
The paper proposes a framework for training a spiking language model (LM) with parameters quantized to 1/1.58-bits. The base architecture follows previous BERT-based spiking models, with spiking encoder layers comprising a spiking attention module and quantized linear layers.
The key aspects of the approach are:
Quantizing linear layers to 1-bit (binary) or 1.58-bit (ternary) weights, while the input to these layers is also quantized to binary spikes. This extreme quantization is enabled by the spiking nature of the architecture, which can spread neuron activation precision across the temporal dimension.
Leveraging the average spiking rate (ASR) at equilibrium of the quantized spiking LM to perform efficient knowledge distillation (KD) from a non-spiking high-precision "teacher" model to the spiking 1/1.58-bit "student" LM. This KD technique is crucial for training the extremely quantized spiking model.
Employing implicit differentiation at equilibrium for training the spiking architecture, which eliminates the need for surrogate gradients and reduces memory requirements compared to backpropagation through time (BPTT).
The proposed 1/1.58-bit spiking LM is evaluated on multiple text classification tasks from the GLUE benchmark and achieves performance close to the full-precision spiking model, while offering significant reductions in model size, energy, and power consumption.
Stats
The total normalized number of operations (Norm#OPS) in the 1.58-bit SpikingBERT model is approximately the same as the full-precision SpikingBERT model for the MRPC dataset.
Each accumulative operation in the 1/1.58-bit quantized models is at least an order of magnitude more energy-efficient than full-precision models in 45nm CMOS technology.
Quotes
"The combination of extreme weight quantization and spiking neuronal activity enables a remarkable reduction in the model's size, energy and power consumption."