The author proposes an audio-textual diffusion model to generate high-quality UTI data with clear tongue contour, crucial for linguistic analysis and clinical assessment.