Core Concepts
Recent advancements in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, have significantly improved the performance of child speech recognition, making it more viable for real-world human-robot interaction applications.
Abstract
The authors revisit a 2017 study on child speech recognition and evaluate the performance of state-of-the-art Automatic Speech Recognition (ASR) engines, including OpenAI's Whisper, Microsoft Azure Speech-to-Text, and Google Cloud Speech-to-Text. The results show a dramatic improvement in recognition performance compared to the 2017 study.
The key findings are:
Transcription accuracy: The best-performing model, Whisper large-v3, achieves a 60.3% relaxed accuracy, meaning it can correctly recognize the majority of child utterances with only minor grammatical differences. This is a significant improvement over the 2017 results, where the best model could only recognize around 20% of utterances.
Responsiveness: The locally hosted Whisper models, especially the smaller versions, can provide sub-second transcription times, making them suitable for real-time spoken interaction. In contrast, the cloud-based solutions have higher latency due to network overhead.
Microphone impact: Using an external microphone, rather than the one embedded in the robot, leads to significantly better recognition performance, regardless of the microphone quality. The robot's own noise has a stronger impact than the microphone choice.
The authors conclude that while adult-like recognition accuracy is not yet achieved, the current state-of-the-art ASR models can provide a usable level of performance for child-robot interaction, especially when combined with other dialogue management components. The authors provide recommendations for selecting the appropriate ASR model based on the trade-offs between accuracy and responsiveness.
Stats
The dog is in front of the horse.
the dog is the front of the horse.
the song in the front of the horse.
Quotes
"Whisper is already rather usable, as small mistakes that do not count as accurate for the relaxed accuracy criteria, could still be handled by dialogue management software."
"To choose which model to use, both responsiveness and performance should be taken into account. Lower results are preferred for both, so models in the lower left corner of the scatter plot are ideal."