The XLSR-Transducer leverages a pretrained XLSR-53 encoder and novel attention masking techniques to achieve high-performance streaming ASR, especially in low-resource scenarios and multiple languages.
The proposed streaming ASR model achieves state-of-the-art performance by efficiently integrating the Mamba encoder, a lookahead mechanism, and a unimodal aggregation framework.
CUSIDE-T, a method that incorporates future context simulation and language model integration into the RNN-T architecture, achieves superior accuracy for streaming ASR compared to the existing U2++ approach.
An efficient and accurate streaming speech recognition model based on the FastConformer architecture, with a cache-based inference mechanism and a hybrid CTC/RNNT architecture to boost accuracy and speed up convergence.