The authors present a series of optimizations to enable efficient deployment of Conformer-based end-to-end speech recognition models on resource-constrained edge devices like smartphones, wearables, and home automation devices.
Key highlights:
Depthwise Separable Convolution (DWS): The authors replace the vanilla convolution subsampling in the original Conformer encoder with DWS, reducing the computational cost from 32.8% to 4.0% while maintaining accuracy.
Memory-aware Graph Execution: The authors adhere to principles for optimizing transformer models on hardware accelerators, such as using the right data format, chunking large tensors, and minimizing memory copies, to improve inference efficiency.
Numerical Stability of Layer Normalization: The authors derive a theory for an optimal low-precision pre-normalizer to numerically stabilize layer normalization computation on hardware accelerators, without requiring model retraining.
Softmax Scaling: The authors introduce a conditional re-scaling technique for softmax layers to enable efficient implementation using lookup tables on hardware accelerators with limited support for complex operations.
The proposed optimizations enable the Conformer-based speech recognition system to achieve over 5.26 times faster-than-real-time (0.19 RTF) performance on small wearables, while minimizing energy consumption and maintaining state-of-the-art accuracy.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Mingbin Xu,A... ב- arxiv.org 04-01-2024
https://arxiv.org/pdf/2312.10359.pdfשאלות מעמיקות