The paper proposes asymmetric and trial-dependent modeling approaches to address the challenges of the SdSV Challenge Task 2, including short-duration utterances, language mismatch, and enrollment-test data distribution mismatch.
Incorporating phonetic information into speaker verification models can improve performance by mitigating biases introduced by the underlying phoneme sequence in the speech signal.
Integrating contrastive learning on intermediate feature maps within a multi-scale feature aggregation architecture significantly improves speaker verification accuracy by enhancing the discriminative power of speaker embeddings.