Core Concepts
A novel framework that leverages intermediate representations extracted from a pre-trained text-to-speech (TTS) model to enhance the performance of open vocabulary keyword spotting.
Abstract
The paper proposes a novel framework for open vocabulary keyword spotting that leverages knowledge from a pre-trained text-to-speech (TTS) model. The key idea is to utilize the intermediate representations from the TTS model as valuable text representations, which can capture acoustic projections and improve the alignment between audio and text embeddings.
The proposed architecture consists of four main components: a text encoder, an audio encoder, a pattern extractor, and a pattern discriminator. The text encoder incorporates the pre-trained Tacotron 2 TTS model to generate text representations that are aware of audio projections. The audio encoder processes the input audio features using convolutional and recurrent layers. The pattern extractor employs a cross-attention mechanism to capture the temporal correlations between the audio and text embeddings. Finally, the pattern discriminator determines whether the audio and text inputs share the same keyword or not.
The performance of the proposed approach is evaluated across four different datasets: Google Commands V1, Qualcomm Keyword Speech, LibriPhrase-Easy, and LibriPhrase-Hard. The results show that the proposed method outperforms various baseline techniques, particularly in the challenging LibriPhrase-Hard dataset, where it achieves significant improvements in area under the curve (AUC) and equal error rate (EER) compared to the cross-modality correspondence detector (CMCD) method.
Additionally, the paper conducts an ablation study to investigate the efficacy of different intermediate representations from the Tacotron 2 model. The results indicate that the Bi-LSTM block output (E3) exhibits the best performance and faster convergence during training. The proposed approach also demonstrates its robustness in the out-of-vocabulary (OOV) scenario, outperforming the CMCD baseline.
Stats
The proposed method outperformed the CMCD baseline by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER) on the challenging LibriPhrase Hard dataset.
The proposed approach showed a consistent improvement of around 3% on the AUC metric and 2.62% on the EER metric across the Google Commands V1 and Qualcomm Keyword Speech datasets compared to the CMCD baseline.
Quotes
"The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER)."
"Analyzing the results, E3 consistently outperforms others in terms of lower Equal Error Rate (EER) and higher AUC and F1-score across all datasets. This suggests it captures both acoustic and linguistic information of the enrolled keyword more effectively."