Sign In

Continuous Speech Separation and Transcription-Supported Diarization for Efficient Meeting Recognition

Core Concepts
A modular pipeline for single-channel meeting transcription that combines continuous speech separation, automatic speech recognition, and transcription-supported diarization to achieve state-of-the-art performance.
The proposed pipeline consists of the following key components: Continuous Speech Separation (CSS): Uses the TF-GridNet architecture to separate the input audio into two overlap-free streams, each containing speech from multiple speakers. Extends the TF-GridNet model from fully overlapping utterances to long conversations with continuous processing and stitching. Automatic Speech Recognition (ASR): Employs the Whisper and ESPnet ASR models to transcribe the separated audio streams. The ASR output provides word and sentence boundary information to support the diarization. Diarization: Applies a modular clustering-based diarization pipeline with VAD, speaker embedding estimation, and k-Means clustering. Introduces a novel sub-segmentation approach that exploits the syntactic information from the ASR output to improve speaker turn detection. The sub-segmentation first uses sentence boundaries and then refines it further using word-level speaker embeddings. The experiments on the Libri-CSS dataset show that the proposed pipeline achieves state-of-the-art performance in terms of Optimal Reference Combination Word Error Rate (ORC WER) and Concatenated Minimum Permutation Word Error Rate (cpWER), outperforming prior work by a relative WER improvement of 20%. The results demonstrate the potential of the CSS-AD pipeline, which performs diarization after separation and ASR, and encourages further research in this direction.
The TF-GridNet separation model achieves an ORC WER of 6.8% on the Libri-CSS dataset, outperforming the no-separation baseline of 26.5% WER. Using the ESPnet ASR system further improves the ORC WER to 6.4%. The proposed transcription-supported diarization approach achieves a cpWER of 7.2%, which is a 20% relative improvement over prior work. With a second pass of the ESPnet ASR on the sub-segments, the cpWER is further improved to 6.2%.
"The attractiveness of those integrated approaches is the training under a common objective function, which potentially delivers superior results." "By performing separation first, we can greatly simplify the diarization system since it does not need to handle overlap." "The idea of exploiting ASR for diarization was exploited in [12]. However, they did not integrate it into a full CSS pipeline since they did not use separation and thus could not handle overlapped speech."

Deeper Inquiries

How could the proposed pipeline be extended to handle more than two speakers in a segment

To extend the proposed pipeline to handle more than two speakers in a segment, the Continuous Speech Separation (CSS) concept can be further developed. Currently, the TF-GridNet model is used to separate speech into two output channels, limiting the system to handling at most two speakers per segment. One approach to extend this capability is to explore advanced neural network architectures that can handle multiple speakers in a segment. This could involve modifying the separation model to output more than two channels, allowing for the separation of multiple speakers simultaneously. Additionally, incorporating speaker counting mechanisms or speaker permutation algorithms can help identify and separate the speech of each individual speaker within a segment, even in scenarios with multiple overlapping speakers.

What other types of syntactic information from the ASR system could be leveraged to further improve the diarization performance

The diarization performance can be further improved by leveraging additional syntactic information from the ASR system. Apart from sentence boundaries, other types of information such as punctuation marks, intonation patterns, and semantic cues can be utilized to enhance the segmentation and clustering process. For example, punctuation marks like commas or question marks can indicate potential speaker changes or pauses in speech, aiding in segmenting the audio into speaker-specific regions. Intonation patterns and semantic cues can provide insights into the speaker's tone, emphasis, or topic transitions, which can be valuable for identifying speaker turns and boundaries. By incorporating a more comprehensive set of syntactic features from the ASR output, the diarization module can achieve higher accuracy in attributing speech segments to the correct speakers.

How could the pipeline be adapted to work with multi-channel audio inputs to potentially achieve even better separation and diarization results

Adapting the pipeline to work with multi-channel audio inputs can significantly enhance the separation and diarization results. By incorporating information from multiple audio channels, the system gains access to spatial cues and additional acoustic features that can improve the accuracy of speaker separation and clustering. One approach is to integrate multi-channel separation models, such as beamforming or spatial filtering techniques, to enhance the source separation process by leveraging the spatial characteristics of the audio signals. Additionally, multi-channel audio inputs can provide redundancy and diversity in the audio data, enabling more robust speaker embedding extraction and clustering. By combining information from multiple channels, the pipeline can achieve better separation of overlapping speech and more accurate diarization of multiple speakers in a meeting scenario.