Jointly training Hybrid Autoregressive Transducer (HAT) with various Connectionist Temporal Classification (CTC) objectives, including the proposed Internal Acoustic Model (IAM), improves HAT-based automatic speech recognition performance. Deploying dual blank thresholding, which combines HAT-blank and IAM-blank thresholding, along with a compatible decoding algorithm, achieves a 42-75% increase in decoding speed without significant degradation in accuracy.
Sortformer는 화자 구분과 음성 인식을 통합하는 새로운 신경망 모델로, 시간 정보와 토큰 간의 연결을 통해 화자 구분 문제를 해결합니다.
대규모 데이터셋에서 Mixture-of-Experts (MoE) 기반 모델이 Dense 모델과 유사한 정확도를 달성하면서도 더 효율적인 추론 속도를 제공할 수 있다.
This study provides a detailed comparative analysis of the performance of various Automatic Speech Recognition (ASR) models, including Whisper, on the Fearless Steps APOLLO corpus of historical NASA Apollo mission communications. The key focus is on identifying and understanding subgroup-specific performance variations, with the goal of informing advancements in ASR systems for Earth-to-space communications.
Echo Multi-Scale Attention (Echo-MSA) is introduced, a module that enhances the accuracy of representing variable-length speech features in automatic speech recognition tasks by using dynamic attention mechanisms adaptable to different speech complexities and durations.
Transducers with Pronunciation-aware Embeddings (PET) can improve speech recognition accuracy by incorporating shared components in the decoder embeddings for text tokens with the same or similar pronunciations.
The authors propose a novel internal language model (ILM) training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic, and ILM scores to achieve substantial performance improvements in automatic speech recognition.
BRAVEn, an extension to the RAVEn method, learns strong visual and auditory speech representations entirely from raw audio-visual data, achieving state-of-the-art performance among self-supervised methods in various settings.
Through architectural and numerical optimizations, the authors demonstrate that Conformer-based end-to-end speech recognition models can be efficiently deployed on resource-constrained devices such as mobile phones and wearables, while preserving recognition accuracy, achieving faster-than-real-time performance, and reducing energy consumption.
A hierarchical recurrent adapter module is introduced that achieves better parameter efficiency in large-scale multi-task adaptation scenarios compared to previous adapter-based approaches and full model fine-tuning.