The DFADD dataset was created to address the limitations of existing audio anti-spoofing datasets, which primarily focus on traditional TTS and voice conversion methods. The dataset includes spoofed audio generated by five diverse and mainstream open-source diffusion and flow-matching based TTS models, including Grad-TTS, NaturalSpeech 2, Style-TTS 2, Matcha-TTS, and PFlow-TTS.
The authors conducted a comprehensive analysis to evaluate the effectiveness of cutting-edge anti-spoofing models when confronted with the synthesized speech generated by these advanced TTS systems. The results showed that models trained on the ASVspoof dataset struggle to detect spoofs from diffusion and flow-matching based TTS models, with equal error rates (EERs) typically above 30%. In contrast, anti-spoofing models trained on the DFADD dataset exhibited significantly improved performance, with an average EER reduction of over 47% compared to the baseline.
The authors also evaluated the models on audio samples collected from various unseen TTS systems, including VoiceBox, VoiceFlow, NaturalSpeech 3, CMTTS, DiffProsody, and DiffAR. The results demonstrated that models trained on the FM-based subsets of DFADD exhibited better generalization capabilities, further highlighting the importance of this dataset in developing more robust anti-spoofing detection models.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania