This paper presents a novel cross-domain audio deepfake detection (CD-ADD) dataset comprising over 300 hours of speech data generated by five advanced zero-shot text-to-speech (TTS) models. The dataset is designed to simulate real-world scenarios and evaluate the generalization capabilities of deepfake detection models.
The VoicePrivacy 2024 Challenge aims to develop voice anonymization systems that conceal speaker identity while preserving linguistic and emotional content in speech data.
Self-supervised pretraining enhances noise-robustness in keyword spotting models, outperforming supervised methods.
Proposing a direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework for improved multilingual communication.
Generative pre-training with flow matching in speech technology shows promising results for various downstream tasks.
VOICECRAFT achieves state-of-the-art performance on speech editing and zero-shot TTS with innovative token rearrangement.
VOICECRAFT achieves state-of-the-art performance in speech editing and zero-shot TTS with innovative token rearrangement.
Integrating pre-trained AV-HuBERT with a Mask-And-Recover strategy enhances target speech extraction performance.
Crowdsourced multilingual speech intelligibility testing offers a cost-efficient and scalable approach to assess speech quality across languages.
XLAVS-Rは、100以上の言語でノイズに強い音声認識と翻訳を可能にするクロスリンガルオーディオビジュアル音声表現モデルです。