Yang, C., Fu, Y., Li, C., Lin, Y., Lin, Y., Chen, W., ... & Lee, H. (2024). Building a Taiwanese Mandarin Spoken Language Model: A First Attempt. arXiv preprint arXiv:2411.07111v1.
This technical report presents the development of a spoken language model (LLM) specifically designed for Taiwanese Mandarin, aiming to enable real-time, speech-to-speech interaction in multi-turn conversations.
The researchers employed a decoder-only transformer architecture for their spoken LLM, initialized with a pre-trained text-based LLM (LLaMA-3.1 8B). They utilized a combination of real-world and synthetic data for training, with the synthetic data generated using text-based LLMs and a Taiwanese Mandarin TTS system. The training process involved pre-training for text-speech alignment and supervised fine-tuning for multi-turn dialogue proficiency. The system incorporated streaming ASR (Whisper), a speech unit encoder (HuBERT), and a diffusion-based speech decoder for real-time interaction.
The researchers demonstrated the potential of achieving real-time, full-duplex speech interaction using a standard decoder-only transformer architecture. They highlighted the importance of using synthetic data to prevent catastrophic forgetting of the initialized text LLM's capabilities. The report also emphasized the challenges of achieving seamless turn-taking and low latency in real-time speech interaction.
This work presents a significant step towards developing open-source spoken LLMs for Taiwanese Mandarin, capable of engaging in natural and fluent conversations. The researchers' approach of leveraging existing text-based LLM architectures and training methodologies offers a promising direction for future research in this area.
This research contributes to the growing field of spoken language modeling, particularly for low-resource languages like Taiwanese Mandarin. The development of such models has significant implications for various applications, including conversational AI, speech-to-speech translation, and accessibility tools.
The report acknowledges the limitations of the current system, particularly the high latency introduced by the streaming components. Future research directions include exploring fully streaming ASR and speech decoding models, improving turn-taking detection accuracy, and reducing overall system latency for a more natural and responsive conversational experience. Additionally, further evaluation of the model's performance on standardized benchmarks and in real-world scenarios is crucial.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Chih-Kai Yan... at arxiv.org 11-12-2024
https://arxiv.org/pdf/2411.07111.pdfDeeper Inquiries