本文提出了一種名為 CJST 的新型聯合語音和文本訓練框架,用於解碼器專用自動語音識別,該框架基於 CTC 壓縮器,無需處理時長即可有效地將文本注入模型,並在域內和跨域場景中均取得了最佳性能。
CJST, a novel framework for decoder-only automatic speech recognition (ASR), leverages a CTC compressor to effectively integrate speech and text data during training, leading to improved performance in both in-domain and cross-domain scenarios.
This paper introduces MADEON, a novel decoder-only architecture for automatic speech recognition (ASR) that utilizes Mamba, a selective state space model (SSM), to efficiently process speech tokens and generate text transcriptions.
本研究提出兩種訓練方法,利用有限的標點數據,實現可同時輸出帶標點和標準化文本的端到端語音識別系統。
This research introduces two novel approaches to train an end-to-end joint punctuated and normalized Automatic Speech Recognition (ASR) system capable of generating both punctuated and normalized transcripts, even with limited punctuated training data.
Regularizing the decoder module of encoder-decoder ASR models with auxiliary classifiers improves robustness, generalization to out-of-domain scenarios, and enables rapid domain adaptation.
Moonshine is a new family of speech recognition models designed for on-device applications, achieving comparable accuracy to OpenAI's Whisper while significantly reducing latency and computational requirements by optimizing for variable-length audio inputs.
This review paper analyzes recent advancements in Automatic Speech Recognition (ASR) by exploring the use of BERT and Connectionist Temporal Classification (CTC) transformers, highlighting their architectures, applications, performance, limitations, and future research directions.
Rev open-sources its state-of-the-art ASR and diarization models, trained on a massive dataset of human-transcribed audio, offering high accuracy and verbatimicity control for non-commercial use.
긴 형식의 음성 인식에서 기존 SpeechLLM의 효율성 문제를 해결하기 위해 선형 스케일링을 달성하고 제한된 attention 윈도우를 사용하는 새로운 스트리밍 모델인 SpeechLLM-XL을 소개합니다.