Speech Technology

Bejelentkezés

betekintés - Speech Technology

Advancing Audio Deepfake Detection: A Novel Cross-Domain Dataset and Comprehensive Analysis

This paper presents a novel cross-domain audio deepfake detection (CD-ADD) dataset comprising over 300 hours of speech data generated by five advanced zero-shot text-to-speech (TTS) models. The dataset is designed to simulate real-world scenarios and evaluate the generalization capabilities of deepfake detection models.

Comprehensive Evaluation Plan for the VoicePrivacy 2024 Challenge: Preserving Speaker Privacy while Maintaining Linguistic and Emotional Content

The VoicePrivacy 2024 Challenge aims to develop voice anonymization systems that conceal speaker identity while preserving linguistic and emotional content in speech data.

Enhancing Noise-Robust Keyword Spotting with Self-Supervised Pretraining

Self-supervised pretraining enhances noise-robustness in keyword spotting models, outperforming supervised methods.

Direct Audio-Visual Speech Translation Framework: AV2AV

Proposing a direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework for improved multilingual communication.

Generative Pre-training for Speech with Flow Matching: A Comprehensive Study

Generative pre-training with flow matching in speech technology shows promising results for various downstream tasks.

VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild

VOICECRAFT achieves state-of-the-art performance on speech editing and zero-shot TTS with innovative token rearrangement.

VOICECRAFT: Zero-Shot Speech Editing and Text-to-Speech in the Wild

VOICECRAFT achieves state-of-the-art performance in speech editing and zero-shot TTS with innovative token rearrangement.

Integrating AV-HuBERT and Mask-And-Recover Strategy for Target Speech Extraction

Integrating pre-trained AV-HuBERT with a Mask-And-Recover strategy enhances target speech extraction performance.

Crowdsourced Multilingual Speech Intelligibility Testing: A Detailed Analysis

Crowdsourced multilingual speech intelligibility testing offers a cost-efficient and scalable approach to assess speech quality across languages.

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception

XLAVS-Rは、100以上の言語でノイズに強い音声認識と翻訳を可能にするクロスリンガルオーディオビジュアル音声表現モデルです。

Rólunk

Termékek

Forrásanyagok