The author explores a recursive fusion model for audio-visual person verification to capture both intra- and inter-modal relationships effectively.