Core Concepts
The author proposes an audio-textual diffusion model to generate high-quality UTI data with clear tongue contour, crucial for linguistic analysis and clinical assessment.
Abstract
An audio-textual diffusion model is introduced to convert speech signals into ultrasound tongue imaging (UTI) data. By integrating acoustic and textual information, the proposed model generates high-quality UTI data with a clear tongue contour. Experimental results demonstrate significant improvements over traditional methods in terms of quality metrics such as RMSE, LPIPS, PSNR, and FID.
The study addresses the limitations of existing Acoustic-to-Articulatory Inversion (AAI) methods by leveraging personalized acoustic information and universal textual inputs. The diffusion model captures temporal dependencies across consecutive pronunciations to enhance the quality of generated UTI data. The fusion of acoustic and textual embeddings through a cross-attention mechanism improves the coherence and naturalness of the generated UTI data.
By applying the diffusion model to convert audio into UTI data, the study showcases advancements in speech technology applications like automatic speech recognition, speech therapy, and speech assessment. The proposed system outperforms DNN-based AAI systems in various evaluation metrics, highlighting its potential for clinical assessments related to tongue function.
Stats
Experimental results showed that the proposed diffusion model could generate high-quality UTI data.
A LPIPS improvement of 67.95% relative was achieved by the diffusion system.
FID decreased from 256.80 to 22.02 with the proposed diffusion AAI system.
The training set consists of 40 speakers with a total dataset duration of 6.85 hours.
The ASR model achieved a character error rate (CER) of 0.02% on the Mandarin speech-ultrasound dataset.
Quotes
"The proposed diffusion AAI system consistently outperformed DNN-based AAI baselines in all metrics."
"The introduction of additional textual information significantly enhanced the quality of generated UTI data."
"The fusion of acoustic and textual embeddings improved coherence and naturalness in generated UTI data."