Core Concepts
Transducers with Pronunciation-aware Embeddings (PET) can improve speech recognition accuracy by incorporating shared components in the decoder embeddings for text tokens with the same or similar pronunciations.
Abstract
The paper proposes Transducers with Pronunciation-aware Embeddings (PET), an extension to the Transducer model that can encode expert knowledge of a pronunciation dictionary in the model's embedding parameters.
The key highlights are:
- PET models use a novel embedding generation scheme where embeddings for text tokens can have shared components based on the similarity of their pronunciations.
- The paper discovers an "error chain reaction" phenomenon in Transducer models, where recognition errors tend to group together instead of being evenly distributed, and one error is likely to cause other subsequent errors.
- PET models are shown to consistently improve speech recognition accuracy for Mandarin Chinese and Korean datasets compared to conventional Transducer models, primarily by mitigating the error chain reaction issue.
- The authors will open-source their implementation with the NeMo toolkit.
Stats
In the AISHELL-2 Mandarin Chinese dataset, the 5000 characters have a total of 1149 different pronunciations, with many homophones.
On the AISHELL-2 iOS-test set, the best PET model achieves a 2.7% relative reduction in character error rate (CER) compared to the baseline.
On the THCHS Mandarin Chinese test set, the best PET model achieves a 7.1% relative reduction in CER.
On the Zeroth-Korean test set, the best PET model achieves a CER of 1.36%, which is the best reported result on this dataset.
Quotes
"Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared components for text tokens with the same or similar pronunciations."
"We uncover a phenomenon that we call error chain reactions. Instead of recognition errors being evenly spread throughout an utterance, they tend to group together, with subsequent errors often following earlier ones."
"PET models effectively mitigate this issue by substantially reducing the likelihood of the model generating additional errors following a prior one."