본 논문에서는 화자인식 임베딩 추출기, 음성 활동 감지(VAD), 중첩 화자 감지(OSD)를 동시에 수행하는 단일 모델을 제안하여 기존 모듈형 시스템보다 빠르고 효율적인 화자 분할 시스템을 구축하는 방법을 제시합니다.
This research paper proposes a novel approach to speaker diarization by jointly training a single model to perform speaker embedding extraction, speech activity detection (VAD), and overlapped speech detection (OSD) simultaneously, achieving competitive performance with faster inference time compared to traditional modular systems.
This paper introduces LS-EEND, a novel frame-wise streaming end-to-end neural diarization model that excels in processing long-form audio and handling a flexible number of speakers by employing online attractor extraction and a multi-step progressive training strategy.
This research paper introduces a novel speaker diarization system using Mamba, a state-space model, within an end-to-end neural and clustering-based pipeline, demonstrating its superiority over traditional RNN and attention-based models in terms of accuracy and efficiency.
The proposed Profile-Error-Tolerant Target-Speaker Voice Activity Detection (PET-TSVAD) model is robust to speaker profile errors introduced in the first pass diarization, outperforming the existing TS-VAD models on both the VoxConverse and DIHARD-I datasets.