The paper introduces LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with large language models (LLMs). LLaMA-Omni integrates a pre-trained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder.
The key highlights are:
LLaMA-Omni eliminates the need for speech transcription, allowing the model to generate text and speech responses directly from speech instructions with extremely low latency.
The speech adaptor maps the speech representations into the embedding space of the LLM, enabling the LLM to comprehend the input speech.
The streaming speech decoder uses a non-autoregressive Transformer to predict the sequence of discrete units corresponding to the speech response, which is generated simultaneously as the LLM autoregressively generates the text response.
To better align with speech interaction scenarios, the authors construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses.
Experimental results show that LLaMA-Omni provides better responses in both content and style compared to previous speech-language models, with a response latency as low as 226ms.
Training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models based on the latest LLMs.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Qingkai Fang... في arxiv.org 09-11-2024
https://arxiv.org/pdf/2409.06666.pdfاستفسارات أعمق