FeruzaSpeech is a high-quality Uzbek speech corpus that provides 60 hours of recordings from a single native female speaker, with transcripts in both Cyrillic and Latin alphabets, to support the development of speech recognition and synthesis technologies for the Uzbek language.
연설자의 음성에 기반하여 청자의 지속적인 머리 동작 반응을 실시간으로 생성하는 데이터 기반 모델을 제안한다.
A novel two-stage framework that cascades target speaker extraction and speech emotion recognition to mitigate the impact of human speech noise on emotion recognition performance.
Individuals with latent post-stroke aphasia, despite performing within normal limits on clinical tests, exhibit subtle differences in prosodic features of speech production compared to neurotypical controls, which can be leveraged to build reliable automated classification tools.
VoxHakka is a freely available, high-quality multi-speaker text-to-speech system designed to synthesize speech in all six major dialects of Taiwanese Hakka, a critically under-resourced language.
Integrating signal processing cues with deep learning techniques can produce accurate phone alignments, leading to better duration modeling and higher-quality text-to-speech synthesis for Indian languages.
An end-to-end system named Pretrain-based Dual-filter Dysarthria Wake-up word Spotting (PD-DWS) is proposed to address the challenge of low-resource dysarthric wake-up word spotting, achieving state-of-the-art performance.
IndicVoices-R is the largest multilingual Indian text-to-speech dataset, comprising 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages, enabling the development of robust and versatile TTS models.
PB-LRDWWS 시스템은 마비말 음성 콘텐츠 특징 추출기와 프로토타입 기반 분류 방법을 결합하여 SLT 2024 저자원 마비말 깨우기 단어 탐지 챌린지에서 우수한 성능을 달성했다.
A novel approach for text-independent phone-to-audio alignment using self-supervised learning, representation learning, and knowledge transfer, which outperforms the state-of-the-art and is adaptable to diverse English accents and other languages.