Belangrijkste concepten
This paper presents a novel cross-domain audio deepfake detection (CD-ADD) dataset comprising over 300 hours of speech data generated by five advanced zero-shot text-to-speech (TTS) models. The dataset is designed to simulate real-world scenarios and evaluate the generalization capabilities of deepfake detection models.
Samenvatting
The paper addresses the urgent need for up-to-date resources to combat the evolving risks of zero-shot TTS technologies. The key highlights and insights are:
-
Construction of the CD-ADD dataset:
- The dataset includes speech data generated by five cutting-edge zero-shot TTS models, including decoder-only and encoder-decoder architectures.
- Quality control measures are implemented during dataset construction to ensure intelligible synthetic speech.
- The dataset introduces two tasks: in-model ADD and cross-model ADD, the latter being more challenging and representative of real-world scenarios.
-
Evaluation of attack resilience:
- Nine different attacks, including traditional and DNN-based methods, are tested on the ADD models.
- Attack-augmented training improves the models' adaptability, with certain attacks even enhancing the models' generalizability.
- Neural codec compression, especially at lower bit rates, poses a significant threat to the detection accuracy.
-
Few-shot learning performance:
- The Wav2Vec2-large and Whisper-medium models demonstrate superior cross-model ADD performance compared to the Wav2Vec2-base model.
- With just one minute of target-domain data, the ADD models can significantly improve their performance, highlighting their fast adaptation capability.
- However, the effectiveness of few-shot fine-tuning is reduced when the audio is compressed using neural codecs.
The study highlights the challenges of cross-domain ADD, the importance of attack-augmented training, and the potential of few-shot learning, providing valuable insights for future research in this domain.
Statistieken
The CD-ADD dataset comprises over 300 hours of training data and 50 hours of test data.
The average utterance length exceeds 8 seconds, which is longer than traditional ASR datasets.
The VALL-E model has the fewest utterances due to its relative instability.
The OpenVoice model has the lowest word error rate (WER) and speaker similarity score among the five zero-shot TTS models.
Citaten
"Audio deepfakes, created by text-to-speech (TTS) and voice conversion (VC) models, pose severe risks to social stability by spreading misinformation, violating privacy, and undermining trust."
"To demonstrate generalization capabilities, several studies have implemented cross-dataset evaluation (Müller et al., 2022; Ba et al., 2023)."
"Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research."