Improving Face-Voice Association Learning through Multimodal Encoding and Effective Pair Selection
The paper proposes the Fuse after Align (FAA) framework, which uses a multimodal encoder to learn cross-modal relations between faces and voices more effectively. It also introduces an effective pair selection method to enhance the diversity and difficulty of training samples, leading to improved performance in voice-face matching, verification, and retrieval tasks.