Core Concepts
The author proposes the Skip-and-Recover strategy using the Skipformer architecture to dynamically reduce sequence input length and improve recognition accuracy in speech recognition tasks.
Abstract
The Skipformer strategy aims to address the computational overload caused by long input sequences in Conformer-based attention models for Automatic Speech Recognition. By splitting frames into crucial, skipping, and ignoring groups based on an intermediate CTC output, the model achieves a significant reduction in input sequence length while maintaining recognition accuracy. Experimental results demonstrate improved efficiency and faster inference speed compared to baseline models like Conformer and Efficient Conformer. The proposed approach shows promise in enhancing computational efficiency without compromising performance.
Stats
Our model reduces the input sequence length by 31 times on Aishell-1 and 22 times on Librispeech corpus.
The proposed Skipformer achieved 4.23% CER on Aishell-1 test set with an 8% relative CER reduction.
Mode 2 of Skipformer was verified to be computationally efficient with the best recognition performance.
Quotes
"Our proposed Skipformer consists of two main operations: frame skipping and recovering."
"The less useful information one frame contains, the simpler model required to model it."
"The more crucial information one frame contains, the more complex model required to model it."