This work aims to establish a robust audio-visual speaker diarization framework that can effectively process content from diverse data domains, addressing challenges such as off-screen speakers, background noise, and domain mismatch.
LoCoNet leverages Long-term Intra-speaker Modeling (LIM) and Short-term Inter-speaker Modeling (SIM) in an interleaved manner to effectively capture both the temporal dependencies of the same speaker and the interactions of speakers in the same scene, achieving state-of-the-art performance on multiple active speaker detection benchmarks.