The paper proposes a novel Profile-Error-Tolerant Target-Speaker Voice Activity Detection (PET-TSVAD) model that is robust to speaker profile errors introduced in the first pass diarization.
The key highlights are:
The existing TS-VAD models suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method. PET-TSVAD is designed to address this issue.
PET-TSVAD extends the transformer-based TS-VAD architecture by introducing a set of learnable pseudo-speaker profiles to handle speakers undetected during the first pass diarization.
During training, PET-TSVAD uses speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between training and testing conditions.
PET-TSVAD adopts Permutation Invariant Training (PIT) to handle the ambiguity in the output-to-reference mapping due to the speaker profile errors.
Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD models on both the VoxConverse and DIHARD-I datasets.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania