Zhou, W., Jia, J., Sari, L., Mahadeokar, J., & Kalinli, O. (2024). CJST: CTC Compressor based Joint Speech and Text Training for Decoder-Only ASR. arXiv preprint arXiv:2411.07607.
This paper introduces CJST, a novel framework for improving decoder-only Automatic Speech Recognition (ASR) by leveraging a CTC compressor for joint speech and text training. The study aims to enhance ASR performance, particularly in scenarios where external language models are not used.
The researchers developed CJST, which utilizes a CTC compressor to align speech and text representations. They explored various compression modes, edge case handling techniques, and the impact of embedding sharing. The framework was evaluated on the Librispeech and in-house datasets, comparing its performance against traditional adaptor-based methods. Additionally, the effectiveness of joint speech and text training was assessed in both in-domain and cross-domain scenarios using Librispeech and TED-LIUM2 datasets.
The authors conclude that CJST offers a robust and effective approach for joint speech and text training in decoder-only ASR systems. The framework's ability to leverage the CTC compressor for modality alignment and its strong performance in various scenarios highlight its potential for advancing ASR technology.
This research significantly contributes to the field of ASR by introducing a novel framework for joint speech and text training in decoder-only models. The findings have practical implications for developing more accurate and efficient ASR systems, particularly in scenarios where external language models are limited or unavailable.
The study primarily focused on offline ASR tasks. Further research could explore the applicability and effectiveness of CJST in online or streaming ASR scenarios. Additionally, investigating the impact of different modality adaptor architectures and training strategies within the CJST framework could lead to further performance improvements.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Wei Zhou, Ju... pada arxiv.org 11-13-2024
https://arxiv.org/pdf/2411.07607.pdfPertanyaan yang Lebih Dalam