The paper proposes a novel framework for open vocabulary keyword spotting that leverages knowledge from a pre-trained text-to-speech (TTS) model. The key idea is to utilize the intermediate representations from the TTS model as valuable text representations, which can capture acoustic projections and improve the alignment between audio and text embeddings.
The proposed architecture consists of four main components: a text encoder, an audio encoder, a pattern extractor, and a pattern discriminator. The text encoder incorporates the pre-trained Tacotron 2 TTS model to generate text representations that are aware of audio projections. The audio encoder processes the input audio features using convolutional and recurrent layers. The pattern extractor employs a cross-attention mechanism to capture the temporal correlations between the audio and text embeddings. Finally, the pattern discriminator determines whether the audio and text inputs share the same keyword or not.
The performance of the proposed approach is evaluated across four different datasets: Google Commands V1, Qualcomm Keyword Speech, LibriPhrase-Easy, and LibriPhrase-Hard. The results show that the proposed method outperforms various baseline techniques, particularly in the challenging LibriPhrase-Hard dataset, where it achieves significant improvements in area under the curve (AUC) and equal error rate (EER) compared to the cross-modality correspondence detector (CMCD) method.
Additionally, the paper conducts an ablation study to investigate the efficacy of different intermediate representations from the Tacotron 2 model. The results indicate that the Bi-LSTM block output (E3) exhibits the best performance and faster convergence during training. The proposed approach also demonstrates its robustness in the out-of-vocabulary (OOV) scenario, outperforming the CMCD baseline.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania