核心概念
Exploring a multimodal approach for device-directed speech detection using large language models.
摘要
The study explores dropping trigger phrases in virtual assistant commands.
Three approaches are tested: acoustic information, ASR decoder outputs, and a multimodal system.
Multimodal system shows significant improvements over text-only and audio-only models.
Training data includes directed and non-directed utterances with trigger phrases.
Evaluation data consists of device-directed and non-directed examples.
Feature extraction involves text, ASR features, and acoustic features.
Method involves an audio encoder, mapping networks, and a decoder-only LLM.
Large language models like GPT2 are used for text comprehension tasks.
Unimodal baselines include linear classifiers on Whisper and CLAP models.
Experiments show the effectiveness of combining text, audio, and decoder signals in a multimodal system.
統計資料
Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%.
Increasing the size of the LLM leads to further relative EER reductions of up to 18% on the dataset.
引述
"Large language models have demonstrated state-of-the-art text comprehension abilities."
"Multimodal information consisting of acoustic features obtained from a pretrained audio encoder yields significant improvements."