The paper proposes a novel approach called LAST (Language Model Aware Speech Tokenization) for training a speech tokenizer by leveraging objectives from pre-trained textual language models (LMs). The key idea is to integrate the text LM into the process of learning discrete speech representations, with the aim of transforming the features from a pre-trained speech model into a new feature space that enables better clustering for speech LMs.
The authors first encode the raw speech signal into a continuous representation using a pre-trained speech encoder (e.g., HuBERT). They then introduce a learnable encoder module to further process the speech representations, followed by a vector quantization (VQ) module to discretize the output. To guide the quantization process towards better sequential modeling, the authors feed the resulting discrete speech tokens into a pre-trained textual LM (e.g., OPT) and optimize the system to perform next token prediction.
Additionally, the authors introduce a reconstruction loss function to stabilize the optimization process and prevent the model from collapsing to a single token or a sequence of tokens.
The authors evaluate LAST on zero-resource speech modeling tasks and automatic speech recognition (ASR). The results demonstrate that LAST outperforms the traditional k-means tokenization method across all setups, both in terms of spoken language modeling and speech-to-text transcription. Importantly, unlike prior work, LAST allows the utilization of a single pre-trained LM for processing both speech and text inputs, setting it apart from conventional tokenization approaches.
The authors also provide an extensive ablation study, which sheds light on the importance of various model design choices, such as speech vocabulary size and text LM size.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Arnon Turetz... a las arxiv.org 09-11-2024
https://arxiv.org/pdf/2409.03701.pdfConsultas más profundas