This project aims to improve the naturalness and expressiveness of Text-to-Speech (TTS) systems by developing a machine learning model that manipulates the prosodic parameters (pitch, duration, and energy) of TTS-generated speech to make it more closely resemble human speech.
Wave-U-Mamba is an efficient and effective end-to-end framework for speech super-resolution that directly generates high-resolution speech waveforms from low-resolution inputs, outperforming existing state-of-the-art models.
A modular pipeline for single-channel meeting transcription that combines continuous speech separation, automatic speech recognition, and transcription-supported diarization to achieve state-of-the-art performance.
A novel method to personalize a lightweight dual-stage speech enhancement model, DeepFilterNet2, by integrating speaker embeddings into the model architecture, achieving significant performance improvements while preserving minimal computational overhead.
The authors present their systems developed for the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge, including techniques for text-to-speech, singing voice synthesis, and automatic speech recognition using discrete speech tokens. Their approaches demonstrate the potential of discrete speech representations to achieve high-quality and low-bitrate speech processing.
This paper presents an end-to-end model that combines a speech enhancement module (ConVoiFilter) and an automatic speech recognition (ASR) module to improve speech recognition performance in noisy, crowded environments. The model utilizes a single-channel speech enhancement approach to isolate the target speaker's voice from background noise and then feeds the enhanced audio into the ASR module.
A novel framework that leverages intermediate representations extracted from a pre-trained text-to-speech (TTS) model to enhance the performance of open vocabulary keyword spotting.
The proposed Profile-Error-Tolerant Target-Speaker Voice Activity Detection (PET-TSVAD) model is robust to speaker profile errors introduced in the first pass diarization, outperforming the existing TS-VAD models on both the VoxConverse and DIHARD-I datasets.
CLaM-TTS employs probabilistic residual vector quantization to achieve superior compression in token length and enable a language model to generate multiple tokens at once, thereby enhancing the efficiency of zero-shot text-to-speech synthesis.
Transfer learning from the state-of-the-art Whisper automatic speech recognition model can effectively predict the distribution of lexical responses perceived by human listeners to noisy speech stimuli, outperforming baseline methods.