Integrating spatial and spectral information through a combined statistical mixture model improves the accuracy of simultaneous diarization and speech separation for meeting transcription, outperforming cascaded approaches and reducing reliance on in-domain training data.
A modular pipeline for single-channel meeting transcription that combines continuous speech separation, automatic speech recognition, and transcription-supported diarization to achieve state-of-the-art performance.