Core Concepts
This work aims to build robust speaker-independent acoustic models for continuous speech recognition of Swedish using the SpeechDat database.
Abstract
The key highlights and insights from the content are:
The paper describes the development of acoustic models for automatic continuous speech recognition of Swedish using hidden Markov models (HMMs) and the SpeechDat database.
The acoustic models were built at the phonetic level, allowing for general speech recognition applications, though a simplified task of digits and natural number recognition was used for model evaluation.
Different types of phone models were tested, including context-independent models and two variations of context-dependent models (within-word and cross-word context expansion).
Extensive experiments were conducted to tune the system parameters, including the number of Gaussian mixture components and the use of retroflex allophones in the lexicon.
The models were evaluated on both the development set (50 speakers) and the evaluation set (200 speakers), with the best overall accuracy of 88.6% achieved using within-word context-expanded models with 8 Gaussian mixtures.
Per-speaker analysis showed that the models performed well across different speaker characteristics, with some exceptions for speakers from certain dialect regions.
The flexibility of the models was demonstrated by testing them on the Waxholm database, which had different characteristics compared to the SpeechDat data used for training.
Further improvements were suggested, such as increasing the number of Gaussian mixture components and exploring strategies to handle stationary noise in the telephone recordings.
Stats
The SpeechDat database used for training and testing contains recordings from 1000 Swedish speakers.
The evaluation set consists of 200 speakers, while the development set has 50 speakers.
The recognition task includes digits, natural numbers, and other short utterances.
Quotes
"The difficulty of the problem increases if we try to build systems for a large set of speakers and for a generic task (large vocabulary)."
"Results are compared to previous similar studies showing a remarkable improvement."
"Knowing m and a can be useful to predict results when new speakers are added to the evaluation set, or to evaluate new developments in the systems."