Core Concepts
The paper proposes the Fuse after Align (FAA) framework, which uses a multimodal encoder to learn cross-modal relations between faces and voices more effectively. It also introduces an effective pair selection method to enhance the diversity and difficulty of training samples, leading to improved performance in voice-face matching, verification, and retrieval tasks.
Abstract
The paper addresses the problem of learning the association between voice and face, which is important for applications like virtual reality, criminal investigations, and multimodal information retrieval. Previous works have relied on cosine similarity or L2 distance to evaluate the likeness of voices and faces, which only considers the embeddings as high-dimensional vectors and utilizes a minimal scope of available information.
The key contributions of the paper are:
- Proposing the use of a multimodal encoder to learn cross-modal relations from a deeper and more diverse perspective, going beyond just optimizing similarity metrics.
- Employing a mixed training objective, combining modality alignment through contrastive learning and direct cross-modality learning through face-voice matching.
- Introducing an effective pair selection method, combining diverse positive pair selection and hard negative mining, to enhance the diversity and difficulty of training samples.
The paper first describes the dual-modality pooling and progressive clustering techniques used to obtain pseudo-labels for the unlabeled data. It then details the two training objectives:
- Face-Voice Contrastive Learning: Uses the multi-similarity loss to learn better unimodal representations before fusion.
- Face-Voice Matching: Trains the multimodal encoder to predict whether a face-voice pair belongs to the same identity or not.
The experiments show that the proposed FAA framework outperforms previous state-of-the-art methods in voice-face matching, verification, and retrieval tasks, demonstrating the effectiveness of the multimodal encoder and the pair selection approach.
Stats
The dataset used for training and evaluation is the VoxCeleb dataset, which contains 153K audio clips and 1.2M face images of 1225 identities.
The training, validation, and test sets contain 16,650, 2,045, and 3,428 videos, respectively.