Основные понятия
The VoicePrivacy 2024 Challenge aims to develop voice anonymization systems that conceal speaker identity while preserving linguistic and emotional content in speech data.
Аннотация
The VoicePrivacy 2024 Challenge is the third edition of a series of competitive benchmarking challenges focused on developing privacy preservation solutions for speech technology. The challenge task is to develop a voice anonymization system that conceals the speaker's identity while protecting the linguistic content and emotional states of the speech data.
The challenge provides development and evaluation datasets, evaluation scripts, baseline anonymization systems, and a list of training resources. Participants will apply their developed anonymization systems, run evaluation scripts, and submit the results and anonymized speech data. The results will be presented at a workshop held in conjunction with Interspeech 2024.
Key changes from the previous 2022 edition include:
- Removal of the requirements to preserve voice distinctiveness and intonation, hence the associated GVD and ρF0 metrics are no longer used.
- Provision of an extended list of datasets and pretrained models for training anonymization systems.
- Simplification of the evaluation protocol and reduction in the running time of the evaluation scripts.
- Use of only objective evaluation with three complementary metrics: equal error rate (EER) as the privacy metric, and word error rate (WER) for automatic speech recognition and unweighted average recall (UAR) for speech emotion recognition as the utility metrics.
The challenge involves four minimum target EER conditions (10%, 20%, 30%, 40%), and participants are encouraged to submit systems for as many conditions as possible. Within each EER interval, systems will be ranked separately in order of increasing WER and decreasing UAR.
Статистика
The ASV system is trained on 363.6 hours of speech data from 921 speakers (439 female, 482 male) in the LibriSpeech train-clean-360 dataset.
The ASR system is trained on 960.9 hours of speech data from 2,338 speakers (1,128 female, 1,210 male) in the full LibriSpeech train-960 dataset.
The SER system is trained on the IEMOCAP dataset, which contains 12 hours of speech data from 10 speakers (5 female, 5 male).