Kernekoncepter
Through architectural and numerical optimizations, the authors demonstrate that Conformer-based end-to-end speech recognition models can be efficiently deployed on resource-constrained devices such as mobile phones and wearables, while preserving recognition accuracy, achieving faster-than-real-time performance, and reducing energy consumption.
Resumé
The authors present a series of optimizations to enable efficient deployment of Conformer-based end-to-end speech recognition models on resource-constrained edge devices like smartphones, wearables, and home automation devices.
Key highlights:
Depthwise Separable Convolution (DWS): The authors replace the vanilla convolution subsampling in the original Conformer encoder with DWS, reducing the computational cost from 32.8% to 4.0% while maintaining accuracy.
Memory-aware Graph Execution: The authors adhere to principles for optimizing transformer models on hardware accelerators, such as using the right data format, chunking large tensors, and minimizing memory copies, to improve inference efficiency.
Numerical Stability of Layer Normalization: The authors derive a theory for an optimal low-precision pre-normalizer to numerically stabilize layer normalization computation on hardware accelerators, without requiring model retraining.
Softmax Scaling: The authors introduce a conditional re-scaling technique for softmax layers to enable efficient implementation using lookup tables on hardware accelerators with limited support for complex operations.
The proposed optimizations enable the Conformer-based speech recognition system to achieve over 5.26 times faster-than-real-time (0.19 RTF) performance on small wearables, while minimizing energy consumption and maintaining state-of-the-art accuracy.
Statistik
The original Conformer CTC model accounts for 32.8% of the overall computation in the subsampling module.
The Depthwise Separable Convolution (DWS) architecture reduces the computational cost of the subsampling module to 4.0%.